-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathnda_hw_2_riggins.Rmd
226 lines (196 loc) · 8.03 KB
/
nda_hw_2_riggins.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
---
title: "Homework 2 for Network Data Analysis"
author: "Daniel P. Hall Riggins, MD"
date: "4/18/2022"
bibliography: network_citations.bib
output: html_document
---
## Prep
Load needed dependencies:
```{r echo=FALSE}
library(tidyverse)
library(tidygraph)
library(ggraph)
library(targets)
```
Ensure data pipeline is fresh and load it:
```{r}
targets::tar_make()
targets::tar_load(MUC_huttlin_graph)
```
The graph I'll use in this assignment was derived by starting with the graph of human protein interactions described by @huttlinDualProteomescaleNetworks2021 then finding its intersection of nodes with a list of piglet orthologs described by @parkComparativeProteomicAnalysis2009. Each protein in Park's list was differentially expressed between piglets with malformed umbilical cords (MUC) and piglets without. See my [github repository](https://github.com/andtheWings/network_data_analysis/blob/main/R/wrangling_shared_MUC_orthologs.R), to view source code for how this graph was created.
Each node in the graph represents a specific protein from a specific cell line. The cell line 293T comes from human embryonic kidney cells and the cell line HCT116 comes from human colon cancer cells. The edges represent macromolecular protein interactions as identified by affinity purification. The researchers used a computational classifier called ComPASS-Plus, which assigns a score to distinguish high-confidence interacting proteins with high specificity (score of >0.75). Our visualization will incorporate cell line origin and ComPASS-Plus score.
## Data Wrangling
To make the ComPASS-score more useful to the layout engine in igraph, we will rescale it to a range of 1 to 10:
```{r}
MUC_huttlin_graph <-
MUC_huttlin_graph |>
activate("edges") |>
mutate(
rescaled_score =
scales::rescale(
as.numeric(score),
to = c(1, 10)
)
)
```
To convince ourselves that the distribution of scores has been preserved, we'll plot the original scores:
```{r}
DataExplorer::plot_histogram(as.numeric(as_tibble(MUC_huttlin_graph)$score))
```
And rescaled scores:
```{r}
DataExplorer::plot_histogram(as_tibble(MUC_huttlin_graph)$rescaled_score)
```
## Plot the Graph
```{r}
ggraph(
MUC_huttlin_graph,
layout='igraph',
# Use Fruchterman-Reingold layout algorithm (force-directed)
algorithm='fr',
# Weight edge distance by the rescaled ComPASS-Plus score
weights = rescaled_score
) +
geom_edge_link(
# Color edges by the rescaled ComPASS-Plus score
aes(color = rescaled_score),
width = 0.5,
# Transparency
alpha = 0.6,
# Arrow specifications
arrow = arrow(length = unit(1, 'mm')),
end_cap = circle(0.5, 'mm')
) +
# Color palette and label for edges in the legend
scale_edge_color_viridis(
name = "Rescaled ComPASS-Plus",
option = "A",
direction = -1
) +
geom_node_point(
# Color nodes by cell line type
aes(color = cell_line),
size = 0.6,
alpha = 0.6
) +
# Color palette and label for edges in the legend
scale_color_manual(
name = "Cell Line",
# There's a package of color palettes inspired by Wes Anderson movies!
values = wesanderson::wes_palette(
name = "GrandBudapest2",
n = 2,
type = "discrete"
)
) +
# Use the generic graph theme
theme_graph() +
# Add a title
labs(
title = "Human proteins whose pig orthologs are differentially \n expressed in the malformed umbilical cord phenotype"
)
```
## Discussion
### Edge Weighting
One subtlety that is hard to appreciate in the graph's totality is how edge distance is weighted by ComPASS-Plus score such that proteins with higher confidence of true interaction are attracted slightly closer together. To illustrate, we'll visualize just a subset of the graph both with a weighted and unweighted layout.
```{r}
# Create a sub-graph just with proteins that interact with 293T_HBB
`293T_HBB_graph` <-
convert(
MUC_huttlin_graph,
to_local_neighborhood,
node = which(.N()$cell_line_and_official_symbol == "293T_HBB"),
order = 1,
mode = "all"
)
# Graph it without weights
ggraph(
`293T_HBB_graph`,
layout='igraph',
algorithm='fr'
) +
geom_edge_link(
aes(color = rescaled_score),
width = 0.5,
alpha = 0.75,
arrow = arrow(length = unit(2, 'mm')),
end_cap = circle(2.5, 'mm')
) +
scale_edge_color_viridis(
name = "Rescaled ComPASS-Plus",
option = "A",
direction = -1
) +
# Just for fun, change the nodes to have text labels
geom_node_label(
aes(
color = cell_line,
label = official_symbol
),
size = 2,
label.padding = unit(0.15, "lines"),
alpha = 0.8
) +
scale_color_manual(
name = "Cell Line",
values = wesanderson::wes_palette(
name = "GrandBudapest2",
n = 2,
type = "discrete"
)
) +
theme_graph() +
labs(
title = "Unweighted protein interactions for 293T_HBB"
)
```
```{r}
# Graph it with weights
ggraph(
`293T_HBB_graph`,
layout='igraph',
algorithm='fr',
weights = rescaled_score
) +
geom_edge_link(
aes(color = rescaled_score),
width = 0.5,
alpha = 0.75,
arrow = arrow(length = unit(2, 'mm')),
end_cap = circle(2.75, 'mm')
) +
scale_edge_color_viridis(
name = "Rescaled ComPASS-Plus",
option = "A",
direction = -1
) +
geom_node_label(
aes(
color = cell_line,
label = official_symbol
),
size = 2,
label.padding = unit(0.15, "lines"),
alpha = 0.8
) +
scale_color_manual(
name = "Cell Line",
values = wesanderson::wes_palette(
name = "GrandBudapest2",
n = 2,
type = "discrete"
)
) +
theme_graph() +
labs(
title = "Weighted protein interactions for 293T_HBB"
)
```
These sub-graph plots also make it easier to see how are arrows are directed from a bait protein to a prey protein.
### Missing Data:
When considering missing interactions for each of the cell lines, one must consider bait **and** prey coverage. As of publication in Cell, Huttlin et al. (2021) had tested in 293T every validated human protein from the [hORFeome V8.1 Library ](https://www.broadinstitute.org/scientific-community/science/platforms/gpp/horfeome-v81-library) as bait, while they had only tested about half of the hORFeome library in the HCT116 cell line. A full accounting of baits used and interactions identified in each cell line can be obtained from Supplemental Table 1. The prey coverage is more scant at 70% of the bait set in 293T and 52% in HCT116. A full accounting of each replicate of bait-prey pairing in each cell line can be obtained from Supplemental Table 2.
### Errors:
@chiangCoverageErrorModels2007 outline three types of potential error in assays of protein-protein interactions, these being related to coverage, systematic measurement error, and stochastic measurement error. The data from Huttlin et al. (2021) is vulnerable to all three types--the first as described in the previous section. One issue of systematic measurement error is how some proteins are more apt to form complexes when in the bait role versus the prey role. Another source of systematic measurement error is how the probability of detecting a protein interaction is affected by relative protein abundance in the medium. Notably, the ComPASS-Plus system of scoring *does* makes adjustments based on relative protein abundance. Finally, in order to reduce stochastic error, which is random, the authors tried to replicate bait-prey pairings as much as possible.
The authors tried to account for some other sources of error too. One was how a protein will elute in different size-based fractions depending on whether it has formed into a protein complex vs a one-to-one interaction. Another was how the prior expectation for an interaction to be present is dependent on the type of cell line used plus how much it has already been characterized. Another consideration is how the rate of stochastic error can be modulated by the sequence of filtering steps used in scoring systems like ComPASS-Plus and the hyperparameters selected for each filtering step.
# Citations {-}