-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathscRNAseqWorkflow.Rmd
617 lines (504 loc) · 27 KB
/
scRNAseqWorkflow.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
---
title: "scRNAseqWorkflow"
output:
pdf_document:
md_document:
variant: markdown_github
---
# Brendan's skeleton scRNAseq workflow using scran, Seurat, and scClustViz
This is an RStudio notebook that reflects my opinion of best practices in single-sample processing of scRNAseq data from the 10X Genomics platform. It is heavily based on the [SimpleSingleCell](https://bioconductor.org/packages/release/workflows/vignettes/simpleSingleCell/inst/doc/tenx.html) and [Seurat](https://satijalab.org/seurat/v3.1/pbmc3k_tutorial.html) tutorials. Normalization is performed using *scran*, with *Seurat* for clustering. Clustering is performed iteratively at higher resolutions and stopped when differential expression between clusters is lost, as assessed by [scClustViz](https://baderlab.github.io/scClustViz/) using the wilcoxon rank-sum test.
At the start of every code block there will be variables to edit to modify the output of that block. I encourage users to run each block individual, assess the output, and modify as needed. scRNAseq analysis is not plug-and-play.
```{r package_installation, eval=FALSE, include=TRUE}
# This code block won't run, but shows the commands to install the required packages
install.packages(c("Seurat","BiocManager","devtools","Matrix"))
BiocManager::install(c("DropletUtils","scater","scran","AnnotationDbi",
"EnsDb.Mmusculus.v79","EnsDb.Hsapiens.v86",
"org.Mm.eg.db","org.Hs.eg.db"))
devtools::install_github("immunogenomics/presto")
devtools::install_github("BaderLab/scClustViz")
```
Please make sure to set `SOURCE_SPECIES` appropriately!
```{r setup}
SOURCE_SPECIES <- "mouse" # "human" if the data is from human!
if (SOURCE_SPECIES == "mouse") {
library(EnsDb.Mmusculus.v79)
ENSDB <- "EnsDb.Mmusculus.v79"
library(org.Mm.eg.db) #library(org.Hs.eg.db) if human
EGDB <- "org.Mm.eg.db"
} else if (SOURCE_SPECIES == "human") {
library(EnsDb.Hsapiens.v86)
ENSDB <- "EnsDb.Hsapiens.v86"
library(org.Hs.eg.db)
EGDB <- "org.Hs.eg.db"
} else {
stop('You must set SOURCE_SPECIES to either "mouse" or "human" at the start of this code block!')
}
library(AnnotationDbi)
library(Matrix)
library(DropletUtils)
library(scater)
library(scran)
library(Seurat)
library(scClustViz)
library(colorspace)
```
## Read in data
Do *one* of the following things, then skip down to "Filter Cells".
### Read data from a count matrix:
Don't do this blindly! You need to check your input file to ensure it has column names (`header=T`), rownames in the first column (`row.names=1`), and is separated by tabs (`sep="\t"`). Otherwise you'll have to adjust the arguments to suit the file.
```{r read_data_from_counts}
temp_counts <- read.table("count_matrix_text_file.txt",
sep="\t",header=T,row.names=1,
as.is=T,quote="\"")
sce <- SingleCellExperiment(assays=list(
counts=Matrix(as.matrix(temp_counts),sparse=T)
))
if (!all(c("ID","Symbol") %in% colnames(rowData(sce)))) {
temp_keytype <- findKeyType(rownames(sce),get(ENSDB))
if (temp_keytype != "GENEID") {
rowData(sce)$ID <- mapIds(get(ENSDB),
keys=rownames(counts(sce)),
column="GENEID",
keytype=temp_keytype)
} else {
rowData(sce)$ID <- rownames(sce)
}
if (temp_keytype != "GENENAME") {
rowData(sce)$Symbol <- mapIds(get(ENSDB),
keys=rownames(counts(sce)),
column="GENENAME",
keytype=temp_keytype)
} else {
rowData(sce)$Symbol <- rownames(sce)
}
}
show(sce)
# cleaning up
rm(list=grep("^temp",ls(),value=T))
```
### Single dataset:
10X Genomics Cell Ranger v3 uses a much better heuristic for determining empty droplets, so its generally safe to go straight to using the filtered matrix.
Run this code block to merge a single dataset:
```{r read_in_data}
input_from_10x <- "filtered_feature_bc_matrix"
sce <- read10xCounts(input_from_10x,col.names=T)
sce <- sce[rowSums(counts(sce)) > 0,]
show(sce)
```
### Merging multiple datasets:
For subsets of cells from multiple datasets, define your subsets here (in the same order you'll load the datasets):
```{r barcodes_to_keep}
temp_barcodes_to_keep <- list(cluster_10Day_barcodes,
cluster_14Day_barcodes,
cluster_L11Day_barcodes)
rm(list=temp)
```
Then run this code block to load and merge multiple datasets. If no subsetting is needed, just skip the previous block. The vector `temp_data_file_paths` should be set to the folder names of each raw dataset. They can be relative or full filepaths. Name the folder containing the data something meaningful, as it will be used as the dataset label (i.e. rename "mm10" to something useful).
```{r merging_multiples_manually}
temp_data_file_paths <- c("D:/Dropbox/GDB/MKlab_v1to2/For Brendan/10Day/",
"D:/Dropbox/GDB/MKlab_v1to2/For Brendan/14Day/",
"D:/Dropbox/GDB/MKlab_v1to2/For Brendan/L11Day/")
# The folder / dataset names are cleaned to
# remove anything that isn't a letter or number.
names(temp_data_file_paths) <- gsub("[^A-Za-z0-9]","",
sapply(strsplit(temp_data_file_paths,"/"),
function(X) X[length(X)]))
# Reading raw data, creating a list of SCE objects.
temp_all_sce <- sapply(temp_data_file_paths,
read10xCounts,
col.names=T,
simplify=F)
# Subsetting each SCE object
if (exists("temp_barcodes_to_keep")) {
temp_all_sce <- mapply(function(SCE,BC) SCE[,BC],
SCE=temp_all_sce,
BC=temp_barcodes_to_keep)
}
# Collecting all gene names across datasets
temp_all_genes <- sort(unique(unlist(lapply(temp_all_sce,rownames))))
# Making unified rowData
temp_rowdata <- do.call("rbind",
lapply(temp_all_sce,function(Y)
rowData(Y)[,Reduce("intersect",
lapply(temp_all_sce,
function(X) colnames(rowData(X))))]))
temp_rowdata <- temp_rowdata[!duplicated(rownames(temp_rowdata)),]
temp_rowdata <- temp_rowdata[temp_all_genes,]
# Adding missing genes to each count matrix and
# generating replacement SCEs with missing genes included
for (i in names(temp_all_sce)) {
temp_missing_genes <- temp_all_genes[!temp_all_genes %in% rownames(temp_all_sce[[i]])]
temp_new_counts <- rbind(
counts(temp_all_sce[[i]]),
Matrix(0,nrow=length(temp_missing_genes),ncol=ncol(temp_all_sce[[i]]),
dimnames=list(temp_missing_genes,colnames(temp_all_sce[[i]])))
)
temp_new_counts <- temp_new_counts[temp_all_genes,]
colnames(temp_new_counts) <- paste(i,colnames(temp_new_counts),sep="_")
temp_sce <- SingleCellExperiment(assays=list(counts=temp_new_counts))
rowData(temp_sce) <- temp_rowdata
colData(temp_sce) <- colData(temp_all_sce[[i]])[,Reduce("intersect",
lapply(temp_all_sce,function(X)
colnames(colData(X))))]
colData(temp_sce)$orig.ident <- i
rownames(colData(temp_sce)) <- colnames(temp_new_counts)
temp_all_sce[[i]] <- temp_sce
}
sce <- do.call("cbind",temp_all_sce)
sce <- sce[rowSums(counts(sce)) > 0,]
show(sce)
# cleaning up
rm(list=c("i",grep("^temp",ls(),value=T)))
```
## Filter cells
```{r cell_qc}
rowData(sce)$CHR <- mapIds(get(ENSDB),
keys=rowData(sce)$ID,
column="SEQNAME",
keytype="GENEID")
temp_coldata <- perCellQCMetrics(sce,percent_top=NA,flatten=T,
subsets=list(Mito=which(rowData(sce)$CHR=="MT")))
colData(sce) <- cbind(colData(sce),temp_coldata)
temp_rowdata <- perFeatureQCMetrics(sce,flatten=T)
rowData(sce) <- cbind(rowData(sce),temp_rowdata)
# cleaning up
rm(list=grep("^temp",ls(),value=T))
```
Filtering cells based on the proportion of mitochondrial gene transcripts per cell. A high proportion of mitochondrial gene transcripts are indicative of poor quality cells, probably due to compromised cell membranes. Keiran Campbell (now at LTRI) has cast some doubt on the validity of using mitochondrial proportion as a determinant of cell health in [this paper](https://doi.org/10.1186/s13059-019-1830-0), so try not to be too aggressive with this filtering. _You can increase the `mads_thresh` to reduce the number of cells filtered._
```{r filter_mito, fig.height=4, fig.width=8}
mads_thresh <- 4
hard_thresh <- 50
mito_thresh <- median(sce$subsets_Mito_percent) + mad(sce$subsets_Mito_percent) * mads_thresh
drop_mito <- sce$subsets_Mito_percent > mito_thresh | sce$subsets_Mito_percent > hard_thresh
par(mar=c(3,3,2,1),mgp=2:0)
hist(sce$subsets_Mito_percent,breaks=50,xlab="% mitochondrial mRNA")
abline(v=mito_thresh,col="red",lwd=2)
mtext(paste(paste0(mads_thresh," MADs over median: "),
paste0(round(mito_thresh,2),"% mitochondrial mRNA"),
paste0(sum(drop_mito)," cells removed"),
sep="\n"),
side=3,line=-3,at=mito_thresh,adj=-0.05)
temp_col <- sequential_hcl(100,palette="Viridis",alpha=0.5,rev=T)
par(mfrow=c(1,2),mar=c(3,3,2,1),mgp=2:0)
plot(sce$sum,sce$detected,log="xy",pch=20,
xlab="Sum of transcript counts per cell",
ylab="Number of genes detected per cell",
col=temp_col[cut(c(0,1,sce$subsets_Mito_percent),100,labels=F)[c(-1,-2)]])
legend("topleft",bty="n",title="Mito %",
legend=c(0,50,100),pch=20,col=temp_col[c(1,50,100)])
plot(sce$sum,sce$detected,log="xy",pch=20,
xlab="Sum of transcript counts per cell",
ylab="Number of genes detected per cell",
col=temp_col[cut(c(0,1,sce$subsets_Mito_percent),100,labels=F)[c(-1,-2)]])
points(sce$sum[drop_mito],sce$detected[drop_mito],
pch=4,col="red")
legend("topleft",bty="n",pch=4,col="red",
title=paste("Mito % >",round(mito_thresh,2)),
legend=paste(sum(drop_mito),"cells"))
```
If there aren't any outlier cells that need filtering, its ok to *not* filter any cells by simply skipping the following code block or setting `remove_high_mito` to `FALSE`.
```{r apply_filter_mito}
remove_high_mito <- TRUE
if (remove_high_mito) {
sce <- sce[,!drop_mito]
}
show(sce)
```
It is important to manually inspect the relationship between library size and gene detection rates per cell to identify obvious outliers. In this case, we've identified a population of cells with a different relationship between library size and complexity, as well as one cell with a clearly outlying library size. _You can select your outlier cells by changing `filt_intercept` and `filt_slope` to define a line in library size vs. detection rate space._
```{r filter_outlier,fig.height=4, fig.width=8}
filt_intercept <- 100
filt_slope <- .055
to_inspect <- sce$detected < (sce$sum * filt_slope + filt_intercept)
temp_col <- sequential_hcl(100,palette="Viridis",alpha=0.5,rev=T)
par(mfrow=c(1,2),mar=c(3,3,2,1),mgp=2:0)
plot(sce$sum,sce$detected,log="",pch=20,
xlab="Sum of transcript counts per cell",
ylab="Number of genes detected per cell",
main="Select outliers to inspect",
col=temp_col[cut(c(0,1,sce$subsets_Mito_percent),100,labels=F)[c(-1,-2)]])
legend("topleft",bty="n",title="Mito %",
legend=c(0,50,100),pch=20,col=temp_col[c(1,50,100)])
abline(filt_intercept,filt_slope,lwd=2,col="red")
plot(sce$sum,sce$detected,log="xy",pch=20,
xlab="Sum of transcript counts per cell",
ylab="Number of genes detected per cell",
main="Select outliers to inspect",
col=temp_col[cut(c(0,1,sce$subsets_Mito_percent),100,labels=F)[c(-1,-2)]])
points(sce$sum[to_inspect],sce$detected[to_inspect],pch=1,col="red")
legend("topleft",bty="n",pch=1,col="red",legend="Outliers")
```
By comparing the transcriptomes of the outlier cells to the remaining cells, we see that they're likely erythrocytes and can be removed.
```{r inspect_outliers,fig.height=4, fig.width=8}
if (sum(to_inspect) > 1) {
out_DR <- RowNNZ(counts(sce)[,to_inspect,drop=F])/ sum(to_inspect)
out_MDGE <- pbapply::pbapply(counts(sce)[,to_inspect],1,function(X) mean(X[X > 0]))
par(mfrow=c(1,2),mar=c(3,3,2,1),mgp=2:0)
plot(mean~detected,data=rowData(sce),
pch=".",cex=2,log="y",main="Gene expression in all cells",
xlab="Detection Rate",ylab="Mean Detected Count")
points(mean~detected,
data=rowData(sce)[grep("^Hb[ab]",rowData(sce)$Symbol),],
pch=20,col="red")
plot(out_DR,out_MDGE,pch=".",cex=2,log="y",
xlab="Detection Rate",ylab="Mean Detected Count",
main="Gene expression in outliers")
points(out_DR[grep("^Hb[ab]",rowData(sce)$Symbol)],
out_MDGE[grep("^Hb[ab]",rowData(sce)$Symbol)],
pch=20,col="red")
legend("topleft",pch=20,col="red",legend="Haemoglobin")
}
```
If you decide to remove these outliers, set `remove_outliers` to `TRUE`.
```{r apply_filter_outliers}
remove_outliers <- TRUE
if (remove_outliers) {
sce <- sce[,!to_inspect]
}
show(sce)
```
## Cell cycle prediction with cyclone
Cyclone generates individual scores for each cell cycle phase. G1 and G2/M are assigned based on these scores, and any cells not strongly scoring for either phase are assigned to S phase.
```{r cyclone}
cycloneSpeciesMarkers <- switch(SOURCE_SPECIES,
mouse="mouse_cycle_markers.rds",
human="human_cycle_markers.rds")
cycScores <- cyclone(sce,gene.names=rowData(sce)$ID,
pairs=readRDS(system.file("exdata",cycloneSpeciesMarkers,package="scran")))
cycScores$phases <- as.factor(cycScores$phases)
cycScores$confidence <- sapply(seq_along(cycScores$phases),function(i) {
if (is.na(cycScores$phases[i])) {
return(0)
} else {
return(cycScores$normalized.scores[i,as.character(cycScores$phases[i])])
}
})
for (l in names(cycScores)) {
if (is.null(dim(cycScores[[l]]))) {
names(cycScores[[l]]) <- colnames(sce)
} else {
rownames(cycScores[[l]]) <- colnames(sce)
}
}
colData(sce)$CyclonePhase <- cycScores$phases
colData(sce)$CycloneConfidence <- cycScores$confidence
```
```{r fig.height=4, fig.width=8}
layout(matrix(c(1,2,1,3,1,4),2),widths=c(2,5,1),heights=c(1,9))
par(mar=rep(0,4),mgp=2:0)
plot.new()
title("Cell cycle phase assignment confidence, library sizes, and number of cells assigned",line=-2,cex.main=1.5)
par(mar=c(3,3,1,1),bty="n")
boxplot(tapply(cycScores$confidence,cycScores$phases,c),
col=qualitative_hcl(3,alpha=.7,palette="Dark 3"),
ylab="Normalized score of assigned cell cycle phase")
par(mar=c(3,3,1,1))
cycDlibSize <- tapply(log10(colData(sce)$sum),cycScores$phases,function(X) density(X))
plot(x=NULL,y=NULL,ylab="Density",xlab=expression(Log[10]~"Library Size"),
xlim=range(log10(colData(sce)$sum)),
ylim=c(min(sapply(cycDlibSize,function(X) min(X$y))),
max(sapply(cycDlibSize,function(X) max(X$y)))))
for (x in 1:length(cycDlibSize)) {
lines(cycDlibSize[[x]],lwd=3,
col=qualitative_hcl(3,alpha=.7,palette="Dark 3")[x])
}
legend("topleft",bty="n",horiz=T,lwd=rep(3,3),legend=levels(cycScores$phases),
col=qualitative_hcl(3,alpha=.7,palette="Dark 3"))
par(mar=c(3,3,1,1))
if (any(is.na(cycScores$phases))) {
barplot(cbind(c(table(cycScores$phases),sum(is.na(cycScores$phases)))),
col=c(qualitative_hcl(3,alpha=.7,palette="Dark 3"),"grey50"),
ylab="Number of cells")
barplot(cbind(c(table(cycScores$phases),sum(is.na(cycScores$phases)))),
col=c(rep(NA,3),"red"),density=c(rep(0,3),20),add=T,
ylab="Number of cells")
# mtext("Not assigned",side=2,las=1,line=2,
# at=sum(!is.na(cycScores$phases)) + sum(is.na(cycScores$phases)) / 2,
# cex=0.7,col="darkred",font=2)
# mtext(expression("" %->% ""),side=2,las=1,line=0,
# at=sum(!is.na(cycScores$phases)) + sum(is.na(cycScores$phases)) / 2,
# cex=0.7,col="darkred",font=2)
mtext(expression("Not assigned" %down% ""),
at=.85,adj=1,cex=0.7,col="darkred",font=2)
} else {
barplot(cbind(table(cycScores$phases)),
col=qualitative_hcl(3,alpha=.7,palette="Dark 3"),
ylab="Number of cells")
}
```
Cells not assigned to any phase probably lack expression of a sufficient number of cell cycle genes. This may be due to a low number of genes detected in those cells (perhaps they are erythrocytes?). It should be possible to filter these cells out in the outlier-filtering step earlier.
## Filter genes
Remove genes detected in 3 or fewer cells, to prevent errors in normalization.
```{r filter_genes}
sce <- sce[rowSums(counts(sce)) >= 3,]
rownames(sce) <- uniquifyFeatureNames(rowData(sce)$ID, rowData(sce)$Symbol)
# Assign gene symbols as rownames.
# You might have to change this depending on your rownames
show(sce)
```
## Normalization
Next step is normalization. Marioni proposed a normalization technique that attempts to generate cell-specific size factors that are robust to differential expression between genes in a heterogenous sample, unlike simple library-size normalization (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0947-7). This method correlates strongly with library size normalization for homogenous samples, but solves a series of linear equations to deconvolute cell-specific size factors for normalization. In order to better handle heterogenous data, they suggest separating the data by simple heirarchical clustering of a Spearman correlation-based distance metric so that they can normalize the separate subpopulations separately to prevent the suppression of true differential expression during normalization.
Normalization is carried out by assigning size factors per gene by the pooling and deconvolution method, then taking the log-ratio between each count and its size factor, and adding a pseudocount of one. Log-transforming the data stabilizes variance by reducing the impact of a few highly variable genes.
Check that clusters aren't too correlated with library size.
```{r sum_factors, fig.height=4, fig.width=8}
temp_qcl <- quickCluster(sce,use.ranks=F,method="igraph")
sce <- computeSumFactors(sce,cluster=temp_qcl)
par(mfrow=c(1,2),mar=c(3,3,2,1),mgp=2:0)
plot(sce$sum,sizeFactors(sce),log="xy",pch=20,main="Coloured by quickCluster",
col=qualitative_hcl(length(levels(temp_qcl)),alpha=.7,palette="Dark 3")[temp_qcl])
# legend("topleft",bty="n",horiz=F,legend=levels(temp_qcl),title="Cluster",
# pch=20,col=qualitative_hcl(length(levels(temp_qcl)),alpha=.7,palette="Dark 3"))
plot(sce$sum,sizeFactors(sce),log="",pch=20,main="Coloured by quickCluster",
col=qualitative_hcl(length(levels(temp_qcl)),alpha=.7,palette="Dark 3")[temp_qcl])
# legend("topleft",bty="n",horiz=F,legend=levels(temp_qcl),title="Cluster",
# pch=20,col=qualitative_hcl(length(levels(temp_qcl)),alpha=.7,palette="Dark 3"))
```
```{r normalization}
sce <- logNormCounts(sce)
```
## Highly-variable genes
```{r HVGs, fig.width=6,fig.height=4}
dec <- modelGeneVar(sce)
par(mar=c(3,3,2,1),mgp=2:0)
plot(dec$mean,dec$total,pch=20,
main="Mean-Variance relationship",
xlab="Mean log-expression",ylab="Variance")
curve(metadata(dec)$trend(x), col="blue", add=TRUE)
top2kHVGs <- getTopHVGs(dec,n=2000)
```
## To Seurat!
Seurat might warn that it's replacing all dashes in rownames with underscores (ie. mt-Cytb becomes mt_Cytb). _This is normal._
**Check to ensure that the metadata columns you expect to be present made it over from the SCE object.** Rename metadata columns if necessary to improve legibility. `sum` used to be `total_counts` and refers to number of transcripts detected per cell. `detected` used to be `total_features` and refers to the number of genes with at least one transcript detected per cell. `subsets_Mito_*` are the same, but only referring to mitochondrial genes, with `subsets_Mito_percent` being the percent of transcripts per cell from mitochondrial genes. `total` is the same as `sum`.
```{r}
seur <- as.Seurat(sce)
# ^cleaning up metadata
```
Seurat cell cycle prediction uses sets of HGNC symbols to score each cell cycle phase. If you have mouse data, it will freak out and show a big long warning about how it can't find the genes, then do a case-insensitive match and be just fine. _You can safely ignore this warning._
```{r scale_and_cell_cycle}
seur <- ScaleData(seur,verbose=F)
seur <- CellCycleScoring(seur,
g2m.features=cc.genes$g2m.genes,
s.features=cc.genes$s.genes)
```
```{r pca}
seur <- RunPCA(seur,features=top2kHVGs,verbose=F)
ElbowPlot(seur,ndims=50)
```
Select the number of principle components to use in downstream analysis, and _set `n_pc` accordingly_.
```{r tsne, fig.height=5,fig.width=5}
n_pc <- 20
seur <- RunTSNE(seur,dims=1:n_pc,reduction="pca",perplexity=30)
plot_tsne(cell_coord=getEmb(seur,"tsne"),
md=getMD(seur)$sum,
md_title="sum",
md_log=T)
if ("orig.ident" %in% colnames(getMD(seur))) {
plot_tsne(cell_coord=getEmb(seur,"tsne"),
md=getMD(seur)$orig.ident,
md_title="orig.ident",
md_log=T)
}
```
You can add plots for other metadata features of interest (ie. batch) here by _copying the plot code and setting the `md=` argument to a different column of your metadata_.
Playing with the `perplexity` parameter in `RunTSNE` can improve the visualization. Perplexity can be interpretted as the number of nearby cells to consider when trying to minimize distance between neighbouring cells.
```{r umap, fig.height=5,fig.width=5}
seur <- RunUMAP(seur,dims=1:n_pc,reduction="pca",verbose=F)
plot_tsne(cell_coord=getEmb(seur,"umap"),
md=getMD(seur)$sum,
md_title="sum",
md_log=T)
if ("orig.ident" %in% colnames(getMD(seur))) {
plot_tsne(cell_coord=getEmb(seur,"umap"),
md=getMD(seur)$orig.ident,
md_title="orig.ident",
md_log=T)
}
```
## Iterative clustering with scClustViz
Seurat implements an interpretation of SNN-Cliq (https://doi.org/10.1093/bioinformatics/btv088) for clustering of single-cell expression data. They use PCs to define the distance metric, then embed the cells in a graph where edges between cells (nodes) are weighted based on their similarity (euclidean distance in PCA space). These edge weights are refined based on Jaccard distance (overlap in local neighbourhoods), and then communities ("quasi-cliques") are identified in the graph using a smart local moving algorithm (SLM, http://dx.doi.org/10.1088/1742-5468/2008/10/P10008) to optimize the modularity measure of the defined communities in the graph.
This code block iterates through "resolutions" of the Seurat clustering method, testing each for overfitting. Overfitting is determined by testing differential expression between all pairs of clusters using a wilcoxon rank-sum test. If there are no significantly differentially expressed genes between nearest neighbouring clusters, iterative clustering is stopped. The output is saved as an sCVdata object for use in scClustViz.
```{r clustering, results="hold"}
max_seurat_resolution <- 0.6 # For the sake of the demo, quit early.
## ^ change this to something large (5?) to ensure iterations stop eventually.
output_filename <- "./for_scClustViz.RData"
FDRthresh <- 0.01 # FDR threshold for statistical tests
min_num_DE <- 1
seurat_resolution <- 0 # Starting resolution is this plus the jump value below.
seurat_resolution_jump <- 0.2
seur <- FindNeighbors(seur,dims=1:n_pc,verbose=F)
sCVdata_list <- list()
DE_bw_clust <- TRUE
while(DE_bw_clust) {
if (seurat_resolution >= max_seurat_resolution) { break }
seurat_resolution <- seurat_resolution + seurat_resolution_jump
# ^ iteratively incrementing resolution parameter
seur <- FindClusters(seur,resolution=seurat_resolution,verbose=F)
message(" ")
message("------------------------------------------------------")
message(paste0("-------- res.",seurat_resolution," with ",
length(levels(Idents(seur)))," clusters --------"))
message("------------------------------------------------------")
if (length(levels(Idents(seur))) <= 1) {
message("Only one cluster found, skipping analysis.")
next
}
# ^ Only one cluster was found, need to bump up the resolution!
if (length(sCVdata_list) >= 1) {
temp_cl <- length(levels(Clusters(sCVdata_list[[length(sCVdata_list)]])))
if (temp_cl == length(levels(Idents(seur)))) {
temp_cli <- length(levels(interaction(
Clusters(sCVdata_list[[length(sCVdata_list)]]),
Idents(seur),
drop=T
)))
if (temp_cli == length(levels(Idents(seur)))) {
message("Clusters unchanged from previous, skipping analysis.")
next
}
}
}
curr_sCVdata <- CalcSCV(
inD=seur,
assayType="RNA",
cl=Idents(seur),
# ^ your most recent clustering results get stored in the Seurat "ident" slot
exponent=2,
# ^ since you used scran for normalization, data is in log2 space.
pseudocount=1,
DRthresh=0.1,
DRforClust="pca",
calcSil=T,
calcDEvsRest=T,
calcDEcombn=T
)
DE_bw_NN <- sapply(DEneighb(curr_sCVdata,FDRthresh),nrow)
# ^ counts # of DE genes between neighbouring clusters at your selected FDR threshold
message(paste("Number of DE genes between nearest neighbours:",min(DE_bw_NN)))
if (min(DE_bw_NN) < min_num_DE) { DE_bw_clust <- FALSE }
# ^ If no DE genes between nearest neighbours, don't loop again.
sCVdata_list[[paste0("res.",seurat_resolution)]] <- curr_sCVdata
}
# cleaning redundant metadata
seur <- DietSeurat(seur,dimreducs=Reductions(seur))
# ^ shrinks the size of the Seurat object by removing the scaled matrix
save(sCVdata_list,seur,file=output_filename)
```
View the scClustViz report by running this code chunk. Note that the `cellMarkers` argument should be changed to reflect marker genes for your expected clusters. Or better yet, use a better cluster annotation method (such as [cellassign](https://shahlab.ca/projects/cellassign/), [SingleR](https://bioconductor.org/packages/release/bioc/html/SingleR.html), or [Capybara](https://github.com/morris-lab/Capybara)), then _name the clusters manually as outlined [here](https://baderlab.github.io/scClustViz/#use-your-own-cluster-names)_.
```{r scClustViz, eval=FALSE, include=TRUE}
runShiny(output_filename,
#change CellMarkers to suit your needs, or remove it and
# use ClusterNames to name your clusters (see web for usage).
cellMarkers=list(
"Cortical precursors"=c("Mki67","Sox2","Pax6","Pcna",
"Nes","Cux1","Cux2"),
"Interneurons"=c("Gad1","Gad2","Npy","Sst","Lhx6",
"Tubb3","Rbfox3","Dcx"),
"Cajal-Retzius neurons"="Reln",
"Intermediate progenitors"="Eomes",
"Projection neurons"=c("Tbr1","Satb2","Fezf2","Bcl11b","Tle4","Nes",
"Cux1","Cux2","Tubb3","Rbfox3","Dcx")
),
annotationDB="org.Mm.eg.db" #"org.Hs.eg.db" for human
)
```