forked from YuLab-SMU/treedata-book
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path05_ggtree_annotation.Rmd
387 lines (268 loc) · 20.2 KB
/
05_ggtree_annotation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
# Phylogenetic Tree Annotation {#chapter5}
```{r include=FALSE}
library("ape")
library("ggplot2")
library("cowplot")
library("treeio")
library("ggtree")
```
## Visualizing and Annotating Tree using Grammar of Graphics
The *ggtree* is designed for more general purpose or specific type of tree visualization and annotation. It supports grammar of graphics\index{grammar of graphics} implemented in *ggplot2* and users can freely visualize/annotate a tree by combining several annotation layers.
(ref:ggtreeNHXscap) Annotating tree using grammar of graphics.
(ref:ggtreeNHXcap) **Annotating tree using grammar of graphics.** The NHX tree was annotated using grammar of graphic syntax by combining different layers using `+` operator. Species information were labelled on the middle of the branches, Duplication events were shown on most recent common ancestor and clade bootstrap value were dispalyed near to it.
```{r echo=F, message=F, warning=F}
library(ggtree)
treetext = "(((ADH2:0.1[&&NHX:S=human], ADH1:0.11[&&NHX:S=human]):
0.05 [&&NHX:S=primates:D=Y:B=100],ADHY:
0.1[&&NHX:S=nematode],ADHX:0.12 [&&NHX:S=insect]):
0.1[&&NHX:S=metazoa:D=N],(ADH4:0.09[&&NHX:S=yeast],
ADH3:0.13[&&NHX:S=yeast], ADH2:0.12[&&NHX:S=yeast],
ADH1:0.11[&&NHX:S=yeast]):0.1[&&NHX:S=Fungi])[&&NHX:D=N];"
tree <- read.nhx(textConnection(treetext))
p = ggtree(tree) + geom_tiplab() +
geom_label(aes(x=branch, label=S), fill='lightgreen') +
geom_label(aes(label=D), fill='steelblue') +
geom_text(aes(label=B), hjust=-.5)
p <- p + xlim(NA, 0.28)
```
\setstretch{1.2}
```{r echo=T, eval=F}
library(ggtree)
treetext = "(((ADH2:0.1[&&NHX:S=human], ADH1:0.11[&&NHX:S=human]):
0.05 [&&NHX:S=primates:D=Y:B=100],ADHY:
0.1[&&NHX:S=nematode],ADHX:0.12 [&&NHX:S=insect]):
0.1[&&NHX:S=metazoa:D=N],(ADH4:0.09[&&NHX:S=yeast],
ADH3:0.13[&&NHX:S=yeast], ADH2:0.12[&&NHX:S=yeast],
ADH1:0.11[&&NHX:S=yeast]):0.1[&&NHX:S=Fungi])[&&NHX:D=N];"
tree <- read.nhx(textConnection(treetext))
ggtree(tree) + geom_tiplab() +
geom_label(aes(x=branch, label=S), fill='lightgreen') +
geom_label(aes(label=D), fill='steelblue') +
geom_text(aes(label=B), hjust=-.5)
```
```{r ggtreeNHX, warning=FALSE, fig.cap="(ref:ggtreeNHXcap)", fig.scap="(ref:ggtreeNHXscap)", out.extra='', echo=F}
print(p)
```
\setstretch{1.5}
Here, as an example, we visualized the tree with several layers to display annotation stored in NHX tags, including a layer of `geom_tiplab` to display tip labels (gene name in this case), a layer using `geom_label` to show species information (`S` tag) colored by lightgreen, a layer of duplication event information (`D` tag) colored by steelblue and another layer using *geom_text* to show bootstrap value (`B` tag).
Layers defined in *ggplot2* can be applied to *ggtree* directly as demonstrated in Figure \@ref(fig:ggtreeNHX) of using *geom_label* and *geom_text*. But *ggplot2* does not provide graphic layers that are specific designed for phylogenetic tree annotation. For instance, layers for tip labels, tree branch scale legend, highlight or labeling clade are all unavailable. To make tree annotation more flexible, a number of layers have been implemented in *ggtree* (Table \@ref(tab:geoms)), enabling different ways of annotation on various parts/components of a phylogenetic tree.
```{r geoms, echo=FALSE, message=FALSE}
geoms <- matrix(c(
"geom_balance", "highlights the two direct descendant clades of an internal node",
"geom_cladelabel", "annotate a clade with bar and text label",
"geom_facet", "plot associated data in specific panel (facet) and align the plot with the tree",
"geom_hilight", "highlight a clade with rectangle",
"geom_inset", "add insets (subplots) to tree nodes",
"geom_label2", "modified version of geom_label, with subsetting supported",
"geom_nodepoint", "annotate internal nodes with symbolic points",
"geom_point2", "modified version of geom_point, with subsetting supported",
"geom_range", "bar layer to present uncertainty of evolutionary inference",
"geom_rootpoint", "annotate root node with symbolic point",
"geom_rootedge", "add root edge to a tree",
"geom_segment2", "modified version of geom_segment, with subsetting supported",
"geom_strip", "annotate associated taxa with bar and (optional) text label",
"geom_taxalink", "associate two related taxa by linking them with a curve",
"geom_text2", "modified version of geom_text, with subsetting supported",
"geom_tiplab", "layer of tip labels",
"geom_tippoint", "annotate external nodes with symbolic points",
"geom_tree", "tree structure layer, with multiple layout supported",
"geom_treescale", "tree branch scale legend"
), ncol=2, byrow=TRUE)
geoms <- as.data.frame(geoms)
colnames(geoms) <- c("Layer", "Description")
knitr::kable(geoms, caption = "Geom layers defined in ggtree.", booktabs = T) %>%
kable_styling(latex_options = c("striped", "hold_position"), full_width = T)
```
## Layers for Tree Annotation
### Colored strips
`r Biocpkg("ggtree")` [@yu_ggtree:_2017] implements _`geom_cladelabel`_ layer to annotate a selected clade with a bar indicating the clade with a corresponding label.
The _`geom_cladelabel`_ layer accepts a selected internal node number and label corresponding clade automatically (Figure \@ref(fig:cladelabel)A). To get the internal node number, please refer to [Chapter 2](#accesor-tidytree).
```{r eval=F}
set.seed(2015-12-21)
tree <- rtree(30)
p <- ggtree(tree) + xlim(NA, 6)
p + geom_cladelabel(node=45, label="test label") +
geom_cladelabel(node=34, label="another clade")
```
Users can set the parameter, `align = TRUE`, to align the clade label, `offset`, to adjust the position and color to set the color of bar and label text *etc* (Figure \@ref(fig:cladelabel)B).
```{r eval=F}
p + geom_cladelabel(node=45, label="test label", align=TRUE, offset = .2, color='red') +
geom_cladelabel(node=34, label="another clade", align=TRUE, offset = .2, color='blue')
```
Users can change the `angle` of the clade label text and relative position from text to bar via the parameter `offset.text`. The size of the bar and text can be changed via the parameters `barsize` and `fontsize` respectively (Figure \@ref(fig:cladelabel)C).
```{r eval=F}
p + geom_cladelabel(node=45, label="test label", align=T, angle=270, hjust='center', offset.text=.5, barsize=1.5) +
geom_cladelabel(node=34, label="another clade", align=T, angle=45, fontsize=8)
```
Users can also use `geom_label` to label the text and can set the background color by `fill` parameter (Figure \@ref(fig:cladelabel)D).
```{r eval=F}
p + geom_cladelabel(node=34, label="another clade", align=T, geom='label', fill='lightblue')
```
(ref:cladelabelscap) Labelling clades.
(ref:cladelabelcap) **Labeling clades.**
```{r cladelabel, echo=FALSE, fig.width=12, fig.height=7.6, fig.cap="(ref:cladelabelcap)", fig.scap="(ref:cladelabelscap)"}
set.seed(2015-12-21)
tree <- rtree(30)
p <- ggtree(tree) + xlim(NA, 8)
p1 = p + geom_cladelabel(node=45, label="test label") +
geom_cladelabel(node=34, label="another clade")
p2 = p + geom_cladelabel(node=45, label="test label", align=T, color='red') +
geom_cladelabel(node=34, label="another clade", align=T, color='blue')
p3 = p + geom_cladelabel(node=45, label="test label", align=T, angle=270,
hjust='center', offset.text=.5, barsize=1.5, fontsize=8) +
geom_cladelabel(node=34, label="another clade", align=T, angle=45)
p4 = p + geom_cladelabel(node=34, label="another clade", align=T, geom='label', fill='lightblue')
plot_grid(p1, p2, p3, p4, ncol=2, labels = LETTERS[1:4])
```
`geom_cladelabel` also supports unrooted tree layouts (Figure \@ref(fig:striplabel)A).
```{r fig.wdith=7, fig.height=7, fig.align='center', warning=FALSE, message=FALSE, eval=F}
ggtree(tree, layout="daylight") +
geom_cladelabel(node=35, label="test label", angle=0,
fontsize=8, offset=.5, vjust=.5) +
geom_cladelabel(node=55, label='another clade',
angle=-95, hjust=.5, fontsize=8)
```
`geom_cladelabel` is designed for labeling Monophyletic (Clade) while there are related taxa that are not form a clade. `ggtree` provides `geom_strip` to add a strip/bar to indicate the association with optional label for Polyphyletic or Paraphyletic (Figure \@ref(fig:striplabel)B).
```{r eval=F}
p + geom_tiplab() +
geom_strip('t10', 't30', barsize=2, color='red',
label="associated taxa", offset.text=.1) +
geom_strip('t1', 't18', barsize=2, color='blue',
label = "another label", offset.text=.1)
```
(ref:striplabelscap) Labeling associated taxa.
(ref:striplabelcap) **Labeling associated taxa.** `geom_cladelabel` is for labeling Monophyletic and it also supports unrooted layout (A). `geom_strip` is designed for labeling associated taxa (Monophyletic, Polyphyletic or Paraphyletic) (B).
```{r striplabel, fig.width=13.5, fig.height=6.5, echo=FALSE, warning=FALSE, fig.cap="(ref:striplabelcap)", fig.scap="(ref:striplabelscap)"}
pg <- ggtree(tree, layout="daylight")
p5 <- pg + geom_cladelabel(node=35, label="test label", angle=0, fontsize=8, offset=.5, vjust=.5) +
geom_cladelabel(node=55, label='another clade', angle=-95, hjust=.5, fontsize=8)
p6 <- p + geom_tiplab() +
geom_strip('t10', 't30', barsize=2, color='red',
label="associated taxa", offset.text=.1) +
geom_strip('t1', 't18', barsize=2, color='blue',
label = "another label", offset.text=.1)
plot_grid(p5, p6, ncol=2, labels=LETTERS[1:2])
```
### Highlight clades
`ggtree` implements _`geom_hilight`_ layer, that accepts an internal node number and add a layer of rectangle to highlight the selected clade Figure (\@ref(fig:hilight)) ^[If you want to plot the tree above the highlighting area, visit [FAQ](#faq-under-the-tree) for details.].
```{r eval=F, fig.width=5, fig.height=5, fig.align="center", warning=FALSE}
nwk <- system.file("extdata", "sample.nwk", package="treeio")
tree <- read.tree(nwk)
ggtree(tree) + geom_hilight(node=21, fill="steelblue", alpha=.6) +
geom_hilight(node=17, fill="darkgreen", alpha=.6)
```
```{r eval=F, fig.width=5, fig.height=5, fig.align="center", warning=FALSE}
ggtree(tree, layout="circular") + geom_hilight(node=21, fill="steelblue", alpha=.6) +
geom_hilight(node=23, fill="darkgreen", alpha=.6)
```
The `geom_hilight` layer also support highlighting clades for unrooted layout trees (Figure \@ref(fig:hilight)C).
```{r eval=FALSE, fig.width=5, fig.height=5, fig.align='center', warning=FALSE, message=FALSE}
pg + geom_hilight(node=55) + geom_hilight(node=35, fill='darkgreen')
```
Another way to highlight selected clades is setting the clades with different colors and/or line types as demonstrated in Figure \@ref(fig:scaleClade).
In addition to `geom_hilight`, `ggtree` also implements `geom_balance`
which is designed to highlight neighboring subclades of a given internal node (Figure \@ref(fig:hilight)D).
```{r fig.width=4, fig.height=5, fig.align='center', warning=FALSE, eval=F}
ggtree(tree) +
geom_balance(node=16, fill='steelblue', color='white', alpha=0.6, extend=1) +
geom_balance(node=19, fill='darkgreen', color='white', alpha=0.6, extend=1)
```
(ref:hilightscap) Highlight selected clades.
(ref:hilightcap) **Highlight selected clades.** Rectangular layout (A), circular/fan (B) and unrooted layouts. Highlight neighboring subclades simultaneously (D).
```{r hilight, echo = FALSE, fig.width=10, fig.height=10, warning=FALSE, fig.cap="(ref:hilightcap)", fig.scap="(ref:hilightscap)"}
nwk <- system.file("extdata", "sample.nwk", package="treeio")
tree <- read.tree(nwk)
p1= ggtree(tree) + geom_hilight(node=21, fill="steelblue", alpha=.6) +
geom_hilight(node=17, fill="darkgreen", alpha=.6)
p2= ggtree(tree, layout="circular") + geom_hilight(node=21, fill="steelblue", alpha=.6) +
geom_hilight(node=23, fill="darkgreen", alpha=.6)
p3 = pg + geom_hilight(node=55) + geom_hilight(node=35, fill='darkgreen')
p4 = ggtree(tree) +
geom_balance(node=16, fill='steelblue', color='white', alpha=0.6, extend=1) +
geom_balance(node=19, fill='darkgreen', color='white', alpha=0.6, extend=1)
plot_grid(p1, p2, p3, p4, ncol=2, labels=LETTERS[1:4])
```
### Taxa connection
Some evolutionary events (e.g. reassortment, horizontal gene transfer) cannot be modeled by a simple tree. `r Biocpkg("ggtree")` provides `geom_taxalink` layer that allows drawing straight or curved lines between any of two nodes in the tree, allow it to represent evolutionary events by connecting taxa.
(ref:taxalinkscap) Linking related taxa.
(ref:taxalinkcap) **Linking related taxa.** This can be used to indicate evolutionary events such as reassortment and horozontal gene transfer.
```{r taxalink, fig.cap="(ref:taxalinkcap)", fig.scap="(ref:taxalinkscap)", fig.width=5, fig.height=5, warning=FALSE}
ggtree(tree) + geom_tiplab() + geom_taxalink('A', 'E') +
geom_taxalink('F', 'K', color='red', linetype = 'dashed',
arrow=grid::arrow(length=grid::unit(0.02, "npc")))
```
### Uncertainty of evolutionary inference
The `geom_range` layer supports displaying interval (highest posterior density, confidence interval, range) as horizontal bars on tree nodes. The center of the interval will anchor to corresponding node. The center by default is the mean value of the interval (Figure \@ref(fig:geomRange)A). We can set the `center` to estimated mean or median value (Figure \@ref(fig:geomRange)B), or observed value. As the tree branch and the interval may not be in the same scale, `r Biocpkg("ggtree")` provides `scale_x_range` to add second x axis for the range (Figure \@ref(fig:geomRange)C). Note that x axis is disable by default theme and we need to enable it if we want to dispaly it (*e.g.* `theme_tree2`).
(ref:geomRangescap) Displaying uncertainty of evolutoinary inference.
(ref:geomRangecap) **Displaying uncertainty of evolutoinary inference.** The center (mean value of the range (A) or estimated value (B)) is anchor to the tree nodes. A second x axis was used for range scaling (C).
```{ eval=FALSE}
file <- system.file("extdata/MEGA7", "mtCDNA_timetree.nex", package = "treeio")
x <- read.mega(file)
p1 <- ggtree(x) + geom_range('reltime_0.95_CI', color='red', size=3, alpha=.3)
p2 <- ggtree(x) + geom_range('reltime_0.95_CI', color='red', size=3, alpha=.3, center='reltime')
p3 <- p2 + scale_x_range() + theme_tree2()
```
```{r geomRange, fig.cap="(ref:geomRangecap)", fig.scap="(ref:geomRangescap)", fig.width=12, fig.height=4, echo=F}
file <- system.file("extdata/MEGA7", "mtCDNA_timetree.nex", package = "treeio")
x <- read.mega(file)
p1 <- ggtree(x) + geom_range('reltime_0.95_CI', color='red', size=3, alpha=.3)
p2 <- ggtree(x) + geom_range('reltime_0.95_CI', color='red', size=3, alpha=.3, center='reltime') + coord_cartesian(ylim = c(1, 7))
p3 <- p2 + scale_x_range() + theme_tree2()
plot_grid(p1, p2, p3, ncol=3, labels = LETTERS[1:3])
```
## Tree annotation with output from evolution software
### Tree annotation using data from evolutionary analysis software
[Chapter 1](#chapter) introduced using `r Biocpkg("treeio")` packages to parse different tree formats and commonly used software outputs to obtain phylogeny-associated data. These imported data as `S4` objects can be visualized directly using `r Biocpkg("ggtree")`. Figure \@ref(fig:ggtreeNHX) demonstrates a tree annotated using the information (species classification, duplication event and bootstrap value) stored in NHX\index{NHX} file. *PHYLDOG* and *RevBayes* output NHX files that can be parsed by `r Biocpkg("treeio")` and visualized by `r Biocpkg("ggtree")` with annotation using their inference data.
Furthermore, the evolutionary data from the inference of *BEAST*, *MrBayes* and *RevBayes*, *d~N~/d~S~* values inferred by *CodeML*, ancestral sequences\index{ancestral sequences} inferred by *HyPhy*, *CodeML* or *BaseML* and short read placement by *EPA* and *pplacer* can be used to annotate the tree directly.
(ref:beastscap) Annotating *BEAST* tree with _length\_95%\_HPD_ and posterior.
(ref:beastcap) **Annotating *BEAST* tree with _length\_95%\_HPD_ and posterior.** Branch length credible intervals (95% HPD) were displayed as red horizontal bars and clade posterior values were shown on the middle of branches.
\setstretch{1.2}
```{r beast, fig.cap="(ref:beastcap)", fig.scap="(ref:beastscap)", fig.width=7, out.extra=''}
file <- system.file("extdata/BEAST", "beast_mcc.tree", package="treeio")
beast <- read.beast(file)
ggtree(beast, aes(color=rate)) +
geom_range(range='length_0.95_HPD', color='red', alpha=.6, size=2) +
geom_nodelab(aes(x=branch, label=round(posterior, 2)), vjust=-.5, size=3) +
scale_color_continuous(low="darkgreen", high="red") +
theme(legend.position=c(.1, .8))
```
\setstretch{1.5}
In Figure \@ref(fig:beast), the tree was visualized and annotated with posterior > 0.9 and demonstrated length uncertainty (95% Highest Posterior Density (HPD) interval).
Ancestral sequences inferred by *HyPhy* can be parsed using [*treeio*](#parsing-hyphy-output), whereas the substitutions along each tree branch was automatically computed and stored inside the phylogenetic tree object (*i.e.*, `S4` object). The *ggtree* can utilize this information in the object to annotate the tree, as demonstrated in Figure \@ref(fig:hyphy).
(ref:hyphyscap) Annotating tree with amino acid substitution determined by ancestral sequences inferred by HYPHY.
(ref:hyphycap) **Annotating tree with amino acid substitution determined by ancestral sequences inferred by HYPHY.** Amino acid substitutions were displayed on the middle of branches.
\setstretch{1.2}
```{r hyphy, fig.width=7.8, fig.height=3.5, warning=FALSE, fig.cap="(ref:hyphycap)", fig.scap="(ref:hyphyscap)", out.extra=''}
nwk <- system.file("extdata/HYPHY", "labelledtree.tree",
package="treeio")
ancseq <- system.file("extdata/HYPHY", "ancseq.nex",
package="treeio")
tipfas <- system.file("extdata", "pa.fas", package="treeio")
hy <- read.hyphy(nwk, ancseq, tipfas)
ggtree(hy) +
geom_text(aes(x=branch, label=AA_subs), size=2,
vjust=-.3, color="firebrick")
```
\setstretch{1.5}
*PAML*'s *BaseML* and *CodeML* can be also used to infer ancestral sequences, whereas *CodeML*\index{CodeML} can infer selection pressure. After parsing this information using [*treeio*](#parsing-paml-output), *ggtree* can integrate this information into the same tree structure and used for annotation as illustrated in Figure \@ref(fig:codeml).
(ref:codemlscap) Annotating tree with animo acid substitution and *d~N~/d~S~* inferred by *CodeML*.
(ref:codemlcap) **Annotating tree with animo acid substitution and *d~N~/d~S~* inferred by *CodeML*.** Branches were rescaled and colored by *d~N~/d~S~* values and amino acid substitutions were displayed on the middle of branches.
\setstretch{1.2}
```{r codeml, fig.cap="(ref:codemlcap)", fig.scap="(ref:codemlscap)", warning=FALSE, out.extra='', fig.height=4}
rstfile <- system.file("extdata/PAML_Codeml", "rst",
package="treeio")
mlcfile <- system.file("extdata/PAML_Codeml", "mlc",
package="treeio")
ml <- read.codeml(rstfile, mlcfile)
ggtree(ml, aes(color=dN_vs_dS), branch.length='dN_vs_dS') +
scale_color_continuous(name='dN/dS', limits=c(0, 1.5),
oob=scales::squish,
low='darkgreen', high='red') +
geom_text(aes(x=branch, label=AA_subs),
vjust=-.5, color='steelblue', size=2) +
theme_tree2(legend.position=c(.9, .3))
```
\setstretch{1.5}
Not only all the tree data that parsed by `r Biocpkg("treeio")` can be used to visualize and annotate phylogenetic tree using `ggtree`, but also other tree and tree-like objects defined in R community are supported. The `r Biocpkg("ggtree")` plays an unique role in R ecosystem to facilitate phylogenetic analysis and it can be easily integrated into other packages and pipelines. For more details, please refer to [chapter 9](#chapter9). In addition to direct support of tree objects, `r Biocpkg("ggtree")` also allow users to plot tree with different types of external data (see also [chapter 7](#chapter7) and [@yu_two_2018]).
## Summary
`r Biocpkg("ggtree")` implements grammar of graphics for annotating phylogenetic trees. Users can use ggplot2 syntax to combine different annotation layers to produce complex tree annotation. If you are familiar with `r CRANpkg("ggplot2")`, tree annotation with high level of customization can be intuitive and flexible using `r Biocpkg("ggtree")`.