forked from YuLab-SMU/treedata-book
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path07_ggtree_tree_with_data.Rmd
298 lines (201 loc) · 18.6 KB
/
07_ggtree_tree_with_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
# Plotting tree with data {#chapter7}
```{r include=F}
library(knitr)
opts_chunk$set(message=FALSE, warning=FALSE, eval=TRUE, echo=TRUE, cache=TRUE)
```
Integrating user data to annotate phylogenetic tree can be done at different
levels. The `r Biocpkg("treeio")` package implements `full_join` methods to
[combine tree data to phylogenetic tree object](https://bioconductor.org/packages/devel/bioc/vignettes/treeio/inst/doc/Importer.html).
The `r CRANpkg("tidytree")` package supports [linking tree data to phylogeny
using tidyverse verbs](https://cran.r-project.org/web/packages/tidytree/vignette/tiytree.html).
`r Biocpkg("ggtree")` supports mapping external data to phylogeny for
visualization and annotation on the fly. Although the feature of linking external data is overlapping among these packages, they have different application scopes. For example, in addition to the `treedata` object, `r Biocpkg("ggtree")` also supports several other tree objects (see [chapter 9](#chapter9)), including `phylo4d`, `phyloseq` and `obkData` that were designed to contain domain specific data. The design of these objects did not consider to support linking external data to the object (it can not be done at tree object level). We can visualize trees from these objects using `r Biocpkg("ggtree")` and link external data at visualization level [@yu_two_2018].
`r Biocpkg("ggtree")` provides two general methods for mapping and visualizing associated external data on phylogenies. [Method 1](#attach-operator) allows external data to be mapped on the tree structure and used as visual characteristic in tree and data visualization. [Method 2](#facet_plot) plots the data with the tree side by side using different geometric functions after reordering the data based on the tree structure. These two methods integrate data with phylogeny for further exploration and comparison in the evolutionary biology context.
## Mapping data to the tree structure {#attach-operator}
In `r Biocpkg("ggtree")`, we implemented an operator, `%<+%`, to attach annotation data to a `ggtree` graphic object. Any data that contains a column of "node" or first column of taxa labels can be integrated using the `%<+%` operator. Multiple datasets can be attached progressively. When the data are attached, all the information stored in the data serve as numerical/categorical node attributes and can be directly used to visualize the tree by scaling the attributes as different colors or line sizes, label the tree using the original values of the attributes or parsing them as [math expression](#faq-formatting-label), [emoji](#phylomoji) or [silhouette image](#ggimage). The following example uses the `%<+%` operator to integrat taxon (tip\_data.csv) and internal node (inode\_data.csv) information and map the data to different colors or shapes of symbolic points and labels (Figure \@ref(fig:attacher)). The tip data contains `imageURL` that links to online figures of the species, which can be parsed and used as tip labels in `r Biocpkg("ggtree")` (see [chapter 8](#chapter8)).
(ref:attacherscap) Example of attaching multiple datasets.
(ref:attachercap) **Example of attaching multiple datasets**.
```{r attacher, fig.width=9.5, fig.height=6.2, warning=FALSE, message=FALSE, fig.cap="(ref:attachercap)", fig.scap="(ref:attacherscap)"}
library(ggimage)
library(ggtree)
url <- paste0("https://raw.githubusercontent.com/TreeViz/",
"metastyle/master/design/viz_targets_exercise/")
x <- read.tree(paste0(url, "tree_boots.nwk"))
info <- read.csv(paste0(url, "tip_data.csv"))
p <- ggtree(x) %<+% info + xlim(-.1, 4)
p2 <- p + geom_tiplab(offset = .6, hjust = .5) +
geom_tippoint(aes(shape = trophic_habit, color = trophic_habit, size = mass_in_kg)) +
theme(legend.position = "right") + scale_size_continuous(range = c(3, 10))
d2 <- read.csv(paste0(url, "inode_data.csv"))
p2 %<+% d2 + geom_label(aes(label = vernacularName.y, fill = posterior)) +
scale_fill_gradientn(colors = RColorBrewer::brewer.pal(3, "YlGnBu"))
```
Although the data integrated by the `%<+%` operator in `r Biocpkg("ggtree")` is for tree visualization, the data attached to the `ggtree` graphic object can be converted to `treedata` object that contains the tree and the attached data (see [session 7.5](#ggtree_object)).
## Alignning graph to the tree based on tree structure {#facet_plot}
For associating phylogenetic tree with different type of plot produced by user's data, `ggtree` provides `geom_facet` layer and `facet_plot` function which accept an input `data.frame` and a `geom` function to draw the input data. The data will be displayed in an additional panel of the plot. The `geom_facet` (or `facet_plot`) is a general solution for linking graphic layer to a tree. The function internally re-orders the input data based on the tree strucutre and visualizes the data at the specific panel by the geometric layer. Users are free to visualize several panels to plot different types of data as demonstrated in Figure \@ref(fig:phyloseq) and to use different geometric layers to plot the same dataset (Figure \@ref(fig:jv2017)) or different datasets on the same panel.
The `geom_facet` is designed to work with most of the `geom` layers defined in `r CRANpkg("ggplot2")` and other `r CRANpkg("ggplot2")`-based packages. A list of the geometric layers that work seamlessly with `geom_facet` and `facet_plot` can be found in Table \@ref(tab:facet-geom). As the `r CRANpkg("ggplot2")` community keeps expanding and more `geom` layers will be implemented in either `r CRANpkg("ggplot2")` or other extensions, `geom_facet` and `facet_plot` will gain more power to present data in future. Note that different `geom` layers can be combined to present data on the same panel and the combinations of different `geom` layers create the possibility to present more complex data with phylogeny.
(ref:plottreescap) Example of plotting SNP and trait data.
(ref:plottreecap) **Example of plotting SNP and trait data**.
```{r plottree, fig.width=12, fig.height=7, message=F, fig.cap="(ref:plottreecap)", fig.scap="(ref:plottreescap)"}
library(ggtree)
remote_folder <- paste0("https://raw.githubusercontent.com/katholt/",
"plotTree/master/tree_example_april2015/")
## read the phylogenetic tree
tree <- read.tree(paste0(remote_folder, "tree.nwk"))
## read the sampling information data set
info <- read.csv(paste0(remote_folder,"info.csv"))
## read and process the allele table
snps<-read.csv(paste0(remote_folder, "alleles.csv"), header = F,
row.names = 1, stringsAsFactor = F)
snps_strainCols <- snps[1,]
snps<-snps[-1,] # drop strain names
colnames(snps) <- snps_strainCols
gapChar <- "?"
snp <- t(snps)
lsnp <- apply(snp, 1, function(x) {
x != snp[1,] & x != gapChar & snp[1,] != gapChar
})
lsnp <- as.data.frame(lsnp)
lsnp$pos <- as.numeric(rownames(lsnp))
lsnp <- tidyr::gather(lsnp, name, value, -pos)
snp_data <- lsnp[lsnp$value, c("name", "pos")]
## read the trait data
bar_data <- read.csv(paste0(remote_folder, "bar.csv"))
## visualize the tree
p <- ggtree(tree)
## attach the sampling information data set
## and add symbols colored by location
p <- p %<+% info + geom_tippoint(aes(color=location))
## visualize SNP and Trait data using dot and bar charts,
## and align them based on tree structure
p + geom_facet(panel = "SNP", data = snp_data, geom = geom_point,
mapping=aes(x = pos, color = location), shape = '|') +
geom_facet(panel = "Trait", data = bar_data, geom = ggstance::geom_barh,
aes(x = dummy_bar_value, color = location, fill = location),
stat = "identity", width = .6) +
theme_tree2(legend.position=c(.05, .85))
```
## Visualize tree with associated matrix {#gheatmap}
The `gheatmap` function is designed to visualize phylogenetic tree with heatmap of associated matrix (either numerical or categorical). `geom_facet` is a general solution for plotting data with the tree, including heatmap. `gheatmap` is specifically designed for plotting heatmap with tree and provides shortcut for handling column labels and color palette. Another difference is that `geom_facet` only supports rectangular and slanted tree layouts while `gheatmap supports rectugular, slanted and circular (Figure \@ref(fig:mgheatmap)) layouts.
In the following example, we visualized a tree of H3 influenza viruses with their associated genotype (Figure \@ref(fig:gheatmap)A).
```{r fig.width=8, fig.height=6, fig.align="center", warning=FALSE, message=FALSE, eval=F}
beast_file <- system.file("examples/MCC_FluA_H3.tree", package="ggtree")
beast_tree <- read.beast(beast_file)
genotype_file <- system.file("examples/Genotype.txt", package="ggtree")
genotype <- read.table(genotype_file, sep="\t", stringsAsFactor=F)
colnames(genotype) <- sub("\\.$", "", colnames(genotype))
p <- ggtree(beast_tree, mrsd="2013-01-01") +
geom_treescale(x=2008, y=1, offset=2) +
geom_tiplab(size=2)
gheatmap(p, genotype, offset=5, width=0.5, font.size=3,
colnames_angle=-45, hjust=0) +
scale_fill_manual(breaks=c("HuH3N2", "pdm", "trig"),
values=c("steelblue", "firebrick", "darkgreen"), name="genotype")
```
The _width_ parameter is to control the width of the heatmap. It supports another parameter _offset_ for controlling the distance between the tree and the heatmap, for instance to allocate space for tip labels.
For time-scaled tree, as in this example, it's more often to use `x` axis by using `theme_tree2`. But with this solution, the heatmap is just another layer and will change the `x` axis. To overcome this issue, we implemented `scale_x_ggtree` to set the x axis more reasonable (Figure \@ref(fig:gheatmap)A).
```{r fig.width=8, fig.height=6, fig.align="center", warning=FALSE, eval=F}
p <- ggtree(beast_tree, mrsd="2013-01-01") +
geom_tiplab(size=2, align=TRUE, linesize=.5) +
theme_tree2()
gheatmap(p, genotype, offset=8, width=0.6,
colnames=FALSE, legend_title="genotype") +
scale_x_ggtree() +
scale_y_continuous(expand=c(0, 0.3))
```
(ref:gheatmapscap) Example of plotting matrix with `gheatmap`.
(ref:gheatmapcap) **Example of plotting matrix with `gheatmap`**.
```{r gheatmap, fig.width=8, fig.height=12, warning=FALSE, message=FALSE, echo=F,fig.cap="(ref:gheatmapcap)", fig.scap="(ref:gheatmapscap)"}
beast_file <- system.file("examples/MCC_FluA_H3.tree", package="ggtree")
beast_tree <- read.beast(beast_file)
genotype_file <- system.file("examples/Genotype.txt", package="ggtree")
genotype <- read.table(genotype_file, sep="\t", stringsAsFactor=F)
colnames(genotype) <- sub("\\.$", "", colnames(genotype))
p1 <- ggtree(beast_tree, mrsd="2013-01-01") +
geom_treescale(x=2008, y=1, offset=2) +
geom_tiplab(size=2)
g1 <- gheatmap(p1, genotype, offset=5, width=0.5, font.size=3,
colnames_angle=-45, hjust=0) +
scale_fill_manual(breaks=c("HuH3N2", "pdm", "trig"),
values=c("steelblue", "firebrick", "darkgreen"), name="genotype")
p2 <- ggtree(beast_tree, mrsd="2013-01-01") +
geom_tiplab(size=2, align=TRUE, linesize=.5) +
theme_tree2()
g2 <- gheatmap(p2, genotype, offset=8, width=0.6,
colnames=FALSE, legend_title="genotype") +
scale_x_ggtree() +
scale_y_continuous(expand=c(0, 0.3))
cowplot::plot_grid(g1, g2, ncol=1, labels=c("A", "B" ))
```
### Visualize tree with multiple associated matrix {#gheatmap-ggnewscale}
Of course, we can use multiple `gheatmap` function call to align several associated matrix with the tree, however, `r CRANpkg("ggplot2")` doesn't allow us to use multiple `fill` scales^[see also discussion in <https://github.com/GuangchuangYu/ggtree/issues/78> and <https://groups.google.com/d/msg/bioc-ggtree/VQqbF79NAWU/IjIvpQOBGwAJ>].
To solve this issue, we can use [ggnewscale](https://github.com/eliocamp/ggnewscale) to create new `fill` scales. Here is an example of using [ggnewscale](https://github.com/eliocamp/ggnewscale) with `gheatmap`.
(ref:mgheatmapscap) Example of plotting multiple matrix with `gheatmap`.
(ref:mgheatmapcap) **Example of plotting multiple matrix with `gheatmap`**.
```{r mgheatmap, fig.width=10, fig.height=8, fig.cap="(ref:gheatmapcap)", fig.scap="(ref:gheatmapscap)"}
nwk <- system.file("extdata", "sample.nwk", package="treeio")
tree <- read.tree(nwk)
circ <- ggtree(tree, layout = "circular")
df <- data.frame(first=c("a", "b", "a", "c", "d", "d", "a", "b", "e", "e", "f", "c", "f"),
second= c("z", "z", "z", "z", "y", "y", "y", "y", "x", "x", "x", "a", "a"))
rownames(df) <- tree$tip.label
df2 <- as.data.frame(matrix(rnorm(39), ncol=3))
rownames(df2) <- tree$tip.label
colnames(df2) <- LETTERS[1:3]
p1 <- gheatmap(circ, df, offset=.8, width=.2,
colnames_angle=95, colnames_offset_y = .25) +
scale_fill_viridis_d(option="D", name="discrete\nvalue")
library(ggnewscale)
p2 <- p1 + new_scale_fill()
gheatmap(p2, df2, offset=15, width=.3,
colnames_angle=90, colnames_offset_y = .25) +
scale_fill_viridis_c(option="A", name="continuous\nvalue")
```
## Visualize tree with multiple sequence alignment {#msaplot}
The `msaplot` accepts a tree (output of `ggtree`) and a fasta file, then it can visualize the tree with sequence alignment. We can specify the `width` (relative to the tree) of the alignment and adjust relative position by `offset`, that are similar to `gheatmap` function.
```{r eval=F}
tree <- read.tree("data/tree.nwk")
p <- ggtree(tree) + geom_tiplab(size=3)
msaplot(p, "data/sequence.fasta", offset=3, width=2)
```
A specific slice of the alignment can also be displayed by specific `window` parameter.
```{r fig.width=7, fig.height=7, fig.align='center', warning=FALSE, eval=F}
p <- ggtree(tree, layout='circular') +
geom_tiplab(offset=4, align=TRUE) + xlim(NA, 12)
msaplot(p, "data/sequence.fasta", window=c(120, 200))
```
(ref:msaplotscap) Example of plotting multiple sequence alignment with a tree.
(ref:msaplotcap) **Example of plotting multiple sequence alignment with a tree**.
```{r msaplot, fig.width=14, fig.height=7, warning=FALSE, echo=F, fig.cap="(ref:msaplotcap)", fig.scap="(ref:msaplotscap)"}
tree <- read.tree("data/tree.nwk")
p <- ggtree(tree) + geom_tiplab(size=3) +
theme(legend.position="none")
g1 = msaplot(p, "data/sequence.fasta", offset=3, width=2)
p2 <- ggtree(tree, layout='circular') +
geom_tiplab(offset=4, align=TRUE) + xlim(NA, 12) +
theme(legend.position="none")
g2 = msaplot(p2, "data/sequence.fasta", window=c(120, 200))
cowplot::plot_grid(g1, g2, ncol=2, labels=c("A", "B"))
```
## The `ggtree` object {#ggtree_object}
<!--
## Update tree view with a new tree
In previous example, we have a _`p`_ object that stored the tree viewing of 13 tips and internal nodes highlighted with specific colored big dots. If users want to apply this pattern (we can imaging a more complex one) to a new tree, you don't need to build the tree step by step. `ggtree` provides an operator, _`%<%`_, for applying the visualization pattern to a new tree.
For example, the pattern in the _`p`_ object will be applied to a new tree with 50 tips as shown below:
```{r fig.width=3, fig.height=3, fig.align="center"}
p %<% rtree(50)
```
-->
## Summary
Although there are many software packages support visualizing phylogenetic tree, plotting tree with data is often missing or with only limited supports. Some of the packages defines `S4` classes to store phylogenetic tree with domain specific data, such as `r CRANpkg("OutbreakTools")` [@jombart_outbreaktools_2014] defined `obkData` for storing tree with epidemiology data and `r Biocpkg("phyloseq")` [@mcmurdie_phyloseq_2013] defines `phyloseq` for storing tree with microbiome data. These packages are capable to present some of the data stored in the object on the tree. However, not all the associated data are supported. For example, species abundance stored in `phyloseq` object is not supported to be visualized using `r Biocpkg("phyloseq")` package. These packages did not provide any utilities to integrate external data for tree visualization. None of these packages support visualizting external data and align the plot to tree based on the tree structure.
`r Biocpkg("ggtree")` provides general solutions for integrating data. Method 1, the `%<+%` operator, can integrate external and internal node data and map the data as visual characteristic to visualize the tree and other datasets used in `geom_facet`. Method 2, the `geom_facet` layer, has no restriction of input data as long as there is a `geom` function available to plot the data (*e.g.* species abundance displayed by `geom_density_ridges` as demonstrated in Figure \@ref(fig:phyloseq)). Users are free to combine different panels and combine different `geom` layers in the same panel (Figure \@ref(fig:jv2017)). `r Biocpkg("ggtree")` has many unique features that cannot be found in other implementations:
1. Integrating node/edge data to the tree can be mapped to visual characteristics of the tree or other datasets (Figure \ref{fig:attacher}).
2. Capable of parsing expression (math symbols or text formatting), emoji and image files ([chapter 8](#chapter8)).
3. No predefined of input data types or how the data should be plotted in `geom_facet` (Table \@ref(tab:facet-geom)).
4. Combining different `geom` functions to visualize associated data is supported (Figure \@ref(fig:jv2017)).
5. Visualizing different datasets on the same panel is supported.
6. Data integrated by `%<+%` can be used in `geom_facet`.
7. Able to add further annotation to specific layers.
8. Modular design by separating tree visualization, data integration (method 1) and graph alignment (method 2).
Modular design is a unique feature for `r Biocpkg("ggtree")` to stand out from other packages. The tree can be visualized with data stored in tree object or external data linked by `%<+%` operator, and fully annotated with multiple layers of annotations (Figure \@ref(fig:attacher) and \@ref(fig:jv2017)), before passing it to `geom_facet`. `geom_facet` can be called progressively to add multiple panels or multiple layers on the same panels (Figure \@ref(fig:jv2017)). This creates the possiblity of plotting full annotated tree with complex data panels that contains multiple graphic layers.
`r Biocpkg("ggtree")` fits the `R` ecosystem and extends the abilities of integrating and presenting data with trees to existing phylogenetic packages. As demonstrated in Figure \@ref(fig:phyloseq), we are able to plot species abundance distributions with `phyloseq` object. This cannot be easily done without `r Biocpkg("ggtree")`. With `r Biocpkg("ggtree")`, we are able to attach additional data to tree objects using `%<+%` and align graph to tree using `geom_facet`. Integrating `r Biocpkg("ggtree")` to existing workflows will definitely extends the abilities and broadens the applications to present phylogeny-associated data, especially for comparative studies.