-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathgDR.Rmd
173 lines (136 loc) · 8.66 KB
/
gDR.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
---
title: "gDR suite"
author: "gDR team"
output: BiocStyle::html_document
vignette: >
%\VignetteIndexEntry{Running the drug response processing pipeline}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
# Introduction {#intro}
Over decades, many departments across gRED and Roche have generated large amounts of drug response screening data using Genentech's rich drug compounds inventory. While extensive labor and time has been invested to generate these data, they are not analyzed in a standardized manner for meaningful comparison. On one hand, large screens are performed across many cell lines and drugs in a semi-automated manner. On the other hand, small-scale studies, which focused on factors that contribute to sensitivity and resistance to certain therapies, are generally performed by each individual scientist with limited automation. These are complementary approaches but were rarely handled the same way. Commercial softwares are available for analyzing large datasets, whereas researchers for small-scale datasets often process data ad hoc through software like PRISM.
Here, we propose a suite of computational tools that enable the processing, archiving, and visualization of drug response data from any experiment, regardless of size or experimental design, thus ensuring reproducibility and implementation of the Findable, Accessible, Interoperable, and Reusable (F.A.I.R.) principles, with the goal of making this accessible to the public community.
For now we share a subset of the gDR suite components for pre-processing and processing the data.
## R Packages {#rpackages}
gDR suite consists of a few packages that power our app and make it a comprehensive tool.
All the packages under the gDR umbrella are stored in the [gDR platform GitHub organization](https://github.roche.com/gdrplatform/).
We are happy to share with you our packages for importing, processing and managing gDR data:
- gDRimport
- gDRcore
- gDRutils
- gDRtestData
## Data structures
The gDR data model is based on the SummarizedExperiment and BumpyMatrix. If readers are unfamiliar with these data models, we recommend first reading [SummarizedExperiment vignettes](https://bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html), followed by the [BumpyMatrix vignettes](https://www.bioconductor.org/packages/devel/bioc/vignettes/BumpyMatrix/inst/doc/BumpyMatrix.html). The SummarizedExperiment data structure enables ease of subsetting within the SummarizedExperiment object, but also provides ease when trying to correlate drug response data with genomic data, as these data may jointly stored in a MultiAssayExperiment. The BumpyMatrix allows for storage of multi-dimensional data while retaining a matrix abstraction.
This data structure is the core data structure that all downstream processing functions as well as visualization tools operate off of.
## Overview
The gDR suite was designed in a modular manner, such that a user can jump into the "standard" end-to-end gDR processing pipeline at several entry points as is suitable for his or her needs. The full pipeline involves:
```
manifest, template(s), raw data
|
| 1. Aggregating all raw data and metadata
| into a single long table.
|
V
single, long table
|
| 2. Transforming the long table into
| a SummarizedExperiment object with BumpyMatrix assays
| by specifying what columns belong on rows,
| columns, and nested.
|
V
SummarizedExperiment object
with raw and treated assays
|
| 3. Normalizing, averaging, and fitting data.
|
V
SummarizedExperiment object
with raw, treated, normalized,
averaged, and metric assays,
ready for use by downstream visualization
```
A user should be able to enter any part of this pipeline as long as they are able to create the intermediate object
(i.e., the individual manifest, template, and raw data files, or a single, long table, or a SummarizedExperiment object with Bumpy assays).
# Quick start
## Aggregating raw data and metadata (1)
The gDR suite ultimately requires a single, long merged table containing both raw data and metadata.
To support a common use case, we provide a convenience function that takes three objects: manifest, template(s), and raw data to create this single, long merged table for the user. The manifest contains metadata on the experimental design, template files specify the drugs and cell lines used, and the raw data output files obtained from a plate reader or a scanner.
Exemplary data can be found here:
```{r, echo=TRUE, results='asis'}
library(gDR)
# get test data from gDRimport package
# i.e. paths to manifest, templates and results files
td <- get_test_data()
manifest_path(td)
template_path(td)
result_path(td)
```
Using the convenience function `import_data`, the long table is easily created:
```{r, echo=TRUE, results='asis', warning=FALSE, results='hide', message=FALSE}
# Import data
imported_data <-
import_data(manifest_path(td), template_path(td), result_path(td))
head(imported_data)
```
This function will expect certain "identifiers" that tell the processing functions which columns in the long table map to certain expected fields, so that each column is interpreted correctly. For more details regarding these identifiers, see the "Details" section of `?identifiers`. Use `set_env_identifier` or `set_SE_identifiers` to set up the correct mappings between the expected fields and your long table column names.
## Transforming data into a SummarizedExperiment (2)
Next, we can transform the long table into our initial SummarizedExperiment object.
To do so, we need to tell the software:
- What should go on rows and columns versus be nested in the assay.
- Which rows in our table to consider as "control" versus "treated" for normalization.
- Which data type should be converted into SE.
We can do so by setting the `untreated_tag` identifier like `set_env_identifier("untreated_tag" = c("MY_CONTROL_TERMINOLOGY_HERE"))`.
specifying the `nested_keys` argument within `create_and_normalize_SE` and specifiying `data_type`.
```{r}
inl <- prepare_input(imported_data)
detected_data_types <- names(inl$exps)
detected_data_types
se <- create_and_normalize_SE(
inl$df_list[["single-agent"]],
data_type = "single-agent",
nested_confounders = inl$nested_confounders)
se
```
Note that this has created a SummarizedExperiment object with `rowData`, `colData`, `metadata` and 3 `assays`.
## Averaging and fitting data (3)
Next, we can average and fit the data of interest.
```{r, echo=TRUE, results='asis', warning=FALSE, results='hide', message=FALSE}
se <- average_SE(se, data_type = "single-agent")
se <- fit_SE(se, data_type = "single-agent")
```
```{r, echo=TRUE}
se
```
## runDrugResponseProcessingPipeline
Steps (2) and (3) can be combined into a single step through a convenience function: `runDrugResponseProcessingPipeline`. Moreover, the output is `MultiAssayExperiment` object with one experiment per each detected data type. Currently four data types are supported: 'single-agent', 'cotreatment', 'codilution' and 'matrix'. The first three data types are processed via the 'single-agent' model while the 'marix' data is processed via the 'combintation' model.
```{r, echo=TRUE, results='asis', warning=FALSE, results='hide', message=FALSE}
# Run gDR pipeline
mae <- runDrugResponseProcessingPipeline(imported_data)
```
```{r, echo=TRUE}
mae
```
Note that our final MultiAssayExperiment object can be made up of multiple experiments with multiple assays:
```{r, echo=TRUE}
names(mae)
SummarizedExperiment::assayNames(mae[[1]])
```
Each assay from each experiment can be easily transformed to `data.table` format using `convert_se_assay_to_dt` function:
```{r, echo=TRUE}
library(kableExtra)
se <- mae[["single-agent"]]
head(convert_se_assay_to_dt(se, "Metrics"))
```
# Appendix
Once the data is stored in the database, there are multiple ways to visualize the data depending on the scientific needs. The primary method to do is through our RShiny visualization tool 'gDRviz'. Here, users can search and select experiments present in the database, and use downstream visualization modules to look at dose response curves, heatmaps, etc.
# SessionInfo {-}
```{r sessionInfo}
sessionInfo()
```