scRNAseqWorkflow.Rmd

---
title: "scRNAseqWorkflow"
output:
  pdf_document:
    
  md_document:
    variant: markdown_github
---

# Brendan's skeleton scRNAseq workflow using scran, Seurat, and scClustViz

This is an RStudio notebook that reflects my opinion of best practices in single-sample processing of scRNAseq data from the 10X Genomics platform.  It is heavily based on the [SimpleSingleCell](https://bioconductor.org/packages/release/workflows/vignettes/simpleSingleCell/inst/doc/tenx.html) and [Seurat](https://satijalab.org/seurat/v3.1/pbmc3k_tutorial.html) tutorials.  Normalization is performed using *scran*, with *Seurat* for clustering.  Clustering is performed iteratively at higher resolutions and stopped when differential expression between clusters is lost, as assessed by [scClustViz](https://baderlab.github.io/scClustViz/) using the wilcoxon rank-sum test.

At the start of every code block there will be variables to edit to modify the output of that block.  I encourage users to run each block individual, assess the output, and modify as needed.  scRNAseq analysis is not plug-and-play.


```{r package_installation, eval=FALSE, include=TRUE}
# This code block won't run, but shows the commands to install the required packages

install.packages(c("Seurat","BiocManager","devtools","Matrix"))
BiocManager::install(c("DropletUtils","scater","scran","AnnotationDbi",
                       "EnsDb.Mmusculus.v79","EnsDb.Hsapiens.v86",
                       "org.Mm.eg.db","org.Hs.eg.db"))
devtools::install_github("immunogenomics/presto")
devtools::install_github("BaderLab/scClustViz")

```

Please make sure to set `SOURCE_SPECIES` appropriately!

```{r setup}
SOURCE_SPECIES <- "mouse" # "human" if the data is from human!

if (SOURCE_SPECIES == "mouse") {
  library(EnsDb.Mmusculus.v79) 
  ENSDB <- "EnsDb.Mmusculus.v79"
  library(org.Mm.eg.db) #library(org.Hs.eg.db) if human
  EGDB <- "org.Mm.eg.db"
} else if (SOURCE_SPECIES == "human") {
  library(EnsDb.Hsapiens.v86)
  ENSDB <- "EnsDb.Hsapiens.v86"
  library(org.Hs.eg.db)
  EGDB <- "org.Hs.eg.db"
} else {
  stop('You must set SOURCE_SPECIES to either "mouse" or "human" at the start of this code block!')
}
library(AnnotationDbi)
library(Matrix)
library(DropletUtils)
library(scater)
library(scran)
library(Seurat)
library(scClustViz)
library(colorspace)
```

## Read in data

Do *one* of the following things, then skip down to "Filter Cells".


### Read data from a count matrix:  

Don't do this blindly!  You need to check your input file to ensure it has column names (`header=T`), rownames in the first column (`row.names=1`), and is separated by tabs (`sep="\t"`).  Otherwise you'll have to adjust the arguments to suit the file.

```{r read_data_from_counts}
temp_counts <- read.table("count_matrix_text_file.txt",
                          sep="\t",header=T,row.names=1,
                          as.is=T,quote="\"")
sce <- SingleCellExperiment(assays=list(
  counts=Matrix(as.matrix(temp_counts),sparse=T)
))
if (!all(c("ID","Symbol") %in% colnames(rowData(sce)))) {
  temp_keytype <- findKeyType(rownames(sce),get(ENSDB))
  if (temp_keytype != "GENEID") {
    rowData(sce)$ID <- mapIds(get(ENSDB),
                              keys=rownames(counts(sce)),
                              column="GENEID",
                              keytype=temp_keytype)
  } else {
    rowData(sce)$ID <- rownames(sce)
  }
  if (temp_keytype != "GENENAME") {
    rowData(sce)$Symbol <- mapIds(get(ENSDB),
                                  keys=rownames(counts(sce)),
                                  column="GENENAME",
                                  keytype=temp_keytype)
  } else {
    rowData(sce)$Symbol <- rownames(sce)
  }
}
show(sce)

# cleaning up
rm(list=grep("^temp",ls(),value=T))
```


### Single dataset:  

10X Genomics Cell Ranger v3 uses a much better heuristic for determining empty droplets, so its generally safe to go straight to using the filtered matrix.

Run this code block to merge a single dataset:

```{r read_in_data}
input_from_10x <- "filtered_feature_bc_matrix"

sce <- read10xCounts(input_from_10x,col.names=T)
sce <- sce[rowSums(counts(sce)) > 0,]
show(sce)
```


### Merging multiple datasets:

For subsets of cells from multiple datasets, define your subsets here (in the same order you'll load the datasets):

```{r barcodes_to_keep}
temp_barcodes_to_keep <- list(cluster_10Day_barcodes,
                              cluster_14Day_barcodes,
                              cluster_L11Day_barcodes)
rm(list=temp)
```

Then run this code block to load and merge multiple datasets.  If no subsetting is needed, just skip the previous block. The vector `temp_data_file_paths` should be set to the folder names of each raw dataset. They can be relative or full filepaths. Name the folder containing the data something meaningful, as it will be used as the dataset label (i.e. rename "mm10" to something useful).

```{r merging_multiples_manually}
temp_data_file_paths <- c("D:/Dropbox/GDB/MKlab_v1to2/For Brendan/10Day/",
                          "D:/Dropbox/GDB/MKlab_v1to2/For Brendan/14Day/",
                          "D:/Dropbox/GDB/MKlab_v1to2/For Brendan/L11Day/")
# The folder / dataset names are cleaned to 
# remove anything that isn't a letter or number.
names(temp_data_file_paths) <- gsub("[^A-Za-z0-9]","",
                                    sapply(strsplit(temp_data_file_paths,"/"),
                                           function(X) X[length(X)]))

# Reading raw data, creating a list of SCE objects.
temp_all_sce <- sapply(temp_data_file_paths,
                       read10xCounts,
                       col.names=T,
                       simplify=F)

# Subsetting each SCE object
if (exists("temp_barcodes_to_keep")) {
  temp_all_sce <- mapply(function(SCE,BC) SCE[,BC],
                         SCE=temp_all_sce,
                         BC=temp_barcodes_to_keep)
}

# Collecting all gene names across datasets
temp_all_genes <- sort(unique(unlist(lapply(temp_all_sce,rownames))))

# Making unified rowData
temp_rowdata <- do.call("rbind",
                        lapply(temp_all_sce,function(Y) 
                          rowData(Y)[,Reduce("intersect",
                                             lapply(temp_all_sce,
                                                    function(X) colnames(rowData(X))))]))
temp_rowdata <- temp_rowdata[!duplicated(rownames(temp_rowdata)),]
temp_rowdata <- temp_rowdata[temp_all_genes,]

# Adding missing genes to each count matrix and 
# generating replacement SCEs with missing genes included
for (i in names(temp_all_sce)) {
  temp_missing_genes <- temp_all_genes[!temp_all_genes %in% rownames(temp_all_sce[[i]])]
  temp_new_counts <- rbind(
    counts(temp_all_sce[[i]]),
    Matrix(0,nrow=length(temp_missing_genes),ncol=ncol(temp_all_sce[[i]]),
           dimnames=list(temp_missing_genes,colnames(temp_all_sce[[i]])))
  )
  temp_new_counts <- temp_new_counts[temp_all_genes,]
  colnames(temp_new_counts) <- paste(i,colnames(temp_new_counts),sep="_")
  temp_sce <- SingleCellExperiment(assays=list(counts=temp_new_counts))
  rowData(temp_sce) <- temp_rowdata
  colData(temp_sce) <- colData(temp_all_sce[[i]])[,Reduce("intersect",
                                                          lapply(temp_all_sce,function(X) 
                                                            colnames(colData(X))))]
  colData(temp_sce)$orig.ident <- i
  rownames(colData(temp_sce)) <- colnames(temp_new_counts)
  temp_all_sce[[i]] <- temp_sce
}
sce <- do.call("cbind",temp_all_sce)

sce <- sce[rowSums(counts(sce)) > 0,]
show(sce)

# cleaning up
rm(list=c("i",grep("^temp",ls(),value=T)))
```


## Filter cells

```{r cell_qc}
rowData(sce)$CHR <- mapIds(get(ENSDB),
                           keys=rowData(sce)$ID, 
                           column="SEQNAME",
                           keytype="GENEID") 

temp_coldata <- perCellQCMetrics(sce,percent_top=NA,flatten=T,
                                 subsets=list(Mito=which(rowData(sce)$CHR=="MT")))
colData(sce) <- cbind(colData(sce),temp_coldata)

temp_rowdata <- perFeatureQCMetrics(sce,flatten=T)
rowData(sce) <- cbind(rowData(sce),temp_rowdata)

# cleaning up
rm(list=grep("^temp",ls(),value=T))
```

Filtering cells based on the proportion of mitochondrial gene transcripts per cell.  A high proportion of mitochondrial gene transcripts are indicative of poor quality cells, probably due to compromised cell membranes. Keiran Campbell (now at LTRI) has cast some doubt on the validity of using mitochondrial proportion as a determinant of cell health in [this paper](https://doi.org/10.1186/s13059-019-1830-0), so try not to be too aggressive with this filtering.  _You can increase the `mads_thresh` to reduce the number of cells filtered._

```{r filter_mito, fig.height=4, fig.width=8}
mads_thresh <- 4
hard_thresh <- 50

mito_thresh <- median(sce$subsets_Mito_percent) + mad(sce$subsets_Mito_percent) * mads_thresh
drop_mito <- sce$subsets_Mito_percent > mito_thresh | sce$subsets_Mito_percent > hard_thresh

par(mar=c(3,3,2,1),mgp=2:0)
hist(sce$subsets_Mito_percent,breaks=50,xlab="% mitochondrial mRNA")
abline(v=mito_thresh,col="red",lwd=2)
mtext(paste(paste0(mads_thresh," MADs over median: "),
            paste0(round(mito_thresh,2),"% mitochondrial mRNA"),
            paste0(sum(drop_mito)," cells removed"),
            sep="\n"),
      side=3,line=-3,at=mito_thresh,adj=-0.05)

temp_col <- sequential_hcl(100,palette="Viridis",alpha=0.5,rev=T)
par(mfrow=c(1,2),mar=c(3,3,2,1),mgp=2:0)
plot(sce$sum,sce$detected,log="xy",pch=20,
     xlab="Sum of transcript counts per cell",
     ylab="Number of genes detected per cell",
     col=temp_col[cut(c(0,1,sce$subsets_Mito_percent),100,labels=F)[c(-1,-2)]])
legend("topleft",bty="n",title="Mito %",
       legend=c(0,50,100),pch=20,col=temp_col[c(1,50,100)])
plot(sce$sum,sce$detected,log="xy",pch=20,
     xlab="Sum of transcript counts per cell",
     ylab="Number of genes detected per cell",
     col=temp_col[cut(c(0,1,sce$subsets_Mito_percent),100,labels=F)[c(-1,-2)]])
points(sce$sum[drop_mito],sce$detected[drop_mito],
       pch=4,col="red")
legend("topleft",bty="n",pch=4,col="red",
       title=paste("Mito % >",round(mito_thresh,2)),
       legend=paste(sum(drop_mito),"cells"))
```

If there aren't any outlier cells that need filtering, its ok to *not* filter any cells by simply skipping the following code block or setting `remove_high_mito` to `FALSE`.

```{r apply_filter_mito}
remove_high_mito <- TRUE
if (remove_high_mito) {
  sce <- sce[,!drop_mito]
}
show(sce)
```

It is important to manually inspect the relationship between library size and gene detection rates per cell to identify obvious outliers.  In this case, we've identified a population of cells with a different relationship between library size and complexity, as well as one cell with a clearly outlying library size.  _You can select your outlier cells by changing `filt_intercept` and `filt_slope` to define a line in library size vs. detection rate space._

```{r filter_outlier,fig.height=4, fig.width=8}
filt_intercept <- 100
filt_slope <- .055
to_inspect <- sce$detected < (sce$sum * filt_slope + filt_intercept)

temp_col <- sequential_hcl(100,palette="Viridis",alpha=0.5,rev=T)
par(mfrow=c(1,2),mar=c(3,3,2,1),mgp=2:0)
plot(sce$sum,sce$detected,log="",pch=20,
     xlab="Sum of transcript counts per cell",
     ylab="Number of genes detected per cell",
     main="Select outliers to inspect",
     col=temp_col[cut(c(0,1,sce$subsets_Mito_percent),100,labels=F)[c(-1,-2)]])
legend("topleft",bty="n",title="Mito %",
       legend=c(0,50,100),pch=20,col=temp_col[c(1,50,100)])
abline(filt_intercept,filt_slope,lwd=2,col="red")


plot(sce$sum,sce$detected,log="xy",pch=20,
     xlab="Sum of transcript counts per cell",
     ylab="Number of genes detected per cell",
     main="Select outliers to inspect",
     col=temp_col[cut(c(0,1,sce$subsets_Mito_percent),100,labels=F)[c(-1,-2)]])
points(sce$sum[to_inspect],sce$detected[to_inspect],pch=1,col="red")
legend("topleft",bty="n",pch=1,col="red",legend="Outliers")

```

By comparing the transcriptomes of the outlier cells to the remaining cells, we see that they're likely erythrocytes and can be removed.

```{r inspect_outliers,fig.height=4, fig.width=8}
if (sum(to_inspect) > 1) {
  out_DR <- RowNNZ(counts(sce)[,to_inspect,drop=F])/ sum(to_inspect)
  out_MDGE <- pbapply::pbapply(counts(sce)[,to_inspect],1,function(X) mean(X[X > 0]))
  
  par(mfrow=c(1,2),mar=c(3,3,2,1),mgp=2:0)
  plot(mean~detected,data=rowData(sce),
       pch=".",cex=2,log="y",main="Gene expression in all cells",
       xlab="Detection Rate",ylab="Mean Detected Count")
  points(mean~detected,
         data=rowData(sce)[grep("^Hb[ab]",rowData(sce)$Symbol),],
         pch=20,col="red")
  plot(out_DR,out_MDGE,pch=".",cex=2,log="y",
       xlab="Detection Rate",ylab="Mean Detected Count",
       main="Gene expression in outliers")
  points(out_DR[grep("^Hb[ab]",rowData(sce)$Symbol)],
         out_MDGE[grep("^Hb[ab]",rowData(sce)$Symbol)],
         pch=20,col="red")
  legend("topleft",pch=20,col="red",legend="Haemoglobin")
}
```

If you decide to remove these outliers, set `remove_outliers` to `TRUE`.

```{r apply_filter_outliers}
remove_outliers <- TRUE

if (remove_outliers) {
  sce <- sce[,!to_inspect]
}
show(sce)
```


## Cell cycle prediction with cyclone
Cyclone generates individual scores for each cell cycle phase.  G1 and G2/M are assigned based on these scores, and any cells not strongly scoring for either phase are assigned to S phase. 

```{r cyclone}
cycloneSpeciesMarkers <- switch(SOURCE_SPECIES,
                                mouse="mouse_cycle_markers.rds",
                                human="human_cycle_markers.rds")

cycScores <- cyclone(sce,gene.names=rowData(sce)$ID,
                     pairs=readRDS(system.file("exdata",cycloneSpeciesMarkers,package="scran")))
cycScores$phases <- as.factor(cycScores$phases)
cycScores$confidence <- sapply(seq_along(cycScores$phases),function(i) {
  if (is.na(cycScores$phases[i])) {
    return(0)
  } else {
    return(cycScores$normalized.scores[i,as.character(cycScores$phases[i])])
  }
})
for (l in names(cycScores)) {
  if (is.null(dim(cycScores[[l]]))) {
    names(cycScores[[l]]) <- colnames(sce)
  } else {
    rownames(cycScores[[l]]) <- colnames(sce)
  }
}
colData(sce)$CyclonePhase <- cycScores$phases
colData(sce)$CycloneConfidence <- cycScores$confidence
```

```{r fig.height=4, fig.width=8}
layout(matrix(c(1,2,1,3,1,4),2),widths=c(2,5,1),heights=c(1,9))
par(mar=rep(0,4),mgp=2:0)
plot.new()
title("Cell cycle phase assignment confidence, library sizes, and number of cells assigned",line=-2,cex.main=1.5)

par(mar=c(3,3,1,1),bty="n")
boxplot(tapply(cycScores$confidence,cycScores$phases,c),
        col=qualitative_hcl(3,alpha=.7,palette="Dark 3"),
        ylab="Normalized score of assigned cell cycle phase")

par(mar=c(3,3,1,1))
cycDlibSize <- tapply(log10(colData(sce)$sum),cycScores$phases,function(X) density(X))
plot(x=NULL,y=NULL,ylab="Density",xlab=expression(Log[10]~"Library Size"),
     xlim=range(log10(colData(sce)$sum)),
     ylim=c(min(sapply(cycDlibSize,function(X) min(X$y))),
            max(sapply(cycDlibSize,function(X) max(X$y)))))
for (x in 1:length(cycDlibSize)) {
  lines(cycDlibSize[[x]],lwd=3,
        col=qualitative_hcl(3,alpha=.7,palette="Dark 3")[x])
}
legend("topleft",bty="n",horiz=T,lwd=rep(3,3),legend=levels(cycScores$phases),
       col=qualitative_hcl(3,alpha=.7,palette="Dark 3"))

par(mar=c(3,3,1,1))
if (any(is.na(cycScores$phases))) {
  barplot(cbind(c(table(cycScores$phases),sum(is.na(cycScores$phases)))),
          col=c(qualitative_hcl(3,alpha=.7,palette="Dark 3"),"grey50"),
          ylab="Number of cells")
  barplot(cbind(c(table(cycScores$phases),sum(is.na(cycScores$phases)))),
          col=c(rep(NA,3),"red"),density=c(rep(0,3),20),add=T,
          ylab="Number of cells")
  # mtext("Not assigned",side=2,las=1,line=2,
  #       at=sum(!is.na(cycScores$phases)) + sum(is.na(cycScores$phases)) / 2,
  #       cex=0.7,col="darkred",font=2)
  # mtext(expression("" %->% ""),side=2,las=1,line=0,
  #       at=sum(!is.na(cycScores$phases)) + sum(is.na(cycScores$phases)) / 2,
  #       cex=0.7,col="darkred",font=2)
  mtext(expression("Not assigned" %down% ""),
        at=.85,adj=1,cex=0.7,col="darkred",font=2)
} else {
  barplot(cbind(table(cycScores$phases)),
          col=qualitative_hcl(3,alpha=.7,palette="Dark 3"),
          ylab="Number of cells")
}
```

Cells not assigned to any phase probably lack expression of a sufficient number of cell cycle genes.  This may be due to a low number of genes detected in those cells (perhaps they are erythrocytes?). It should be possible to filter these cells out in the outlier-filtering step earlier.


## Filter genes
Remove genes detected in 3 or fewer cells, to prevent errors in normalization.
```{r filter_genes}
sce <- sce[rowSums(counts(sce)) >= 3,]
rownames(sce) <- uniquifyFeatureNames(rowData(sce)$ID, rowData(sce)$Symbol)
# Assign gene symbols as rownames. 
# You might have to change this depending on your rownames
show(sce)
```


## Normalization
Next step is normalization.  Marioni proposed a normalization technique that attempts to generate cell-specific size factors that are robust to differential expression between genes in a heterogenous sample, unlike simple library-size normalization (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0947-7).  This method correlates strongly with library size normalization for homogenous samples, but solves a series of linear equations to deconvolute cell-specific size factors for normalization.  In order to better handle heterogenous data, they suggest separating the data by simple heirarchical clustering of a Spearman correlation-based distance metric so that they can normalize the separate subpopulations separately to prevent the suppression of true differential expression during normalization.  

Normalization is carried out by assigning size factors per gene by the pooling and deconvolution method, then taking the log-ratio between each count and its size factor, and adding a pseudocount of one.  Log-transforming the data stabilizes variance by reducing the impact of a few highly variable genes.  

Check that clusters aren't too correlated with library size.

```{r sum_factors, fig.height=4, fig.width=8}
temp_qcl <- quickCluster(sce,use.ranks=F,method="igraph")
sce <- computeSumFactors(sce,cluster=temp_qcl)

par(mfrow=c(1,2),mar=c(3,3,2,1),mgp=2:0)
plot(sce$sum,sizeFactors(sce),log="xy",pch=20,main="Coloured by quickCluster",
     col=qualitative_hcl(length(levels(temp_qcl)),alpha=.7,palette="Dark 3")[temp_qcl])
# legend("topleft",bty="n",horiz=F,legend=levels(temp_qcl),title="Cluster",
#        pch=20,col=qualitative_hcl(length(levels(temp_qcl)),alpha=.7,palette="Dark 3"))
plot(sce$sum,sizeFactors(sce),log="",pch=20,main="Coloured by quickCluster",
     col=qualitative_hcl(length(levels(temp_qcl)),alpha=.7,palette="Dark 3")[temp_qcl])
# legend("topleft",bty="n",horiz=F,legend=levels(temp_qcl),title="Cluster",
#        pch=20,col=qualitative_hcl(length(levels(temp_qcl)),alpha=.7,palette="Dark 3"))
```

```{r normalization}
sce <- logNormCounts(sce)
```


## Highly-variable genes
```{r HVGs, fig.width=6,fig.height=4}
dec <- modelGeneVar(sce)
par(mar=c(3,3,2,1),mgp=2:0)
plot(dec$mean,dec$total,pch=20,
     main="Mean-Variance relationship",
     xlab="Mean log-expression",ylab="Variance")
curve(metadata(dec)$trend(x), col="blue", add=TRUE)

top2kHVGs <- getTopHVGs(dec,n=2000)
```

## To Seurat!

Seurat might warn that it's replacing all dashes in rownames with underscores (ie. mt-Cytb becomes mt_Cytb).  _This is normal._  
**Check to ensure that the metadata columns you expect to be present made it over from the SCE object.**  Rename metadata columns if necessary to improve legibility.  `sum` used to be `total_counts` and refers to number of transcripts detected per cell. `detected` used to be `total_features` and refers to the number of genes with at least one transcript detected per cell.  `subsets_Mito_*` are the same, but only referring to mitochondrial genes, with `subsets_Mito_percent` being the percent of transcripts per cell from mitochondrial genes.  `total` is the same as `sum`.

```{r}
seur <- as.Seurat(sce)
seur@meta.data <- seur@meta.data[,!colnames(seur@meta.data) %in% c("Sample","Barcode")]
# ^cleaning up metadata
```

Seurat cell cycle prediction uses sets of HGNC symbols to score each cell cycle phase.  If you have mouse data, it will freak out and show a big long warning about how it can't find the genes, then do a case-insensitive match and be just fine. _You can safely ignore this warning._
```{r scale_and_cell_cycle}
seur <- ScaleData(seur,verbose=F)
seur <- CellCycleScoring(seur,
                         g2m.features=cc.genes$g2m.genes,
                         s.features=cc.genes$s.genes)
```

```{r pca}
seur <- RunPCA(seur,features=top2kHVGs,verbose=F)
ElbowPlot(seur,ndims=50)
```

Select the number of principle components to use in downstream analysis, and _set `n_pc` accordingly_.

```{r tsne, fig.height=5,fig.width=5}
n_pc <- 20

seur <- RunTSNE(seur,dims=1:n_pc,reduction="pca",perplexity=30)
plot_tsne(cell_coord=getEmb(seur,"tsne"),
          md=getMD(seur)$sum,
          md_title="sum",
          md_log=T)
if ("orig.ident" %in% colnames(getMD(seur))) {
  plot_tsne(cell_coord=getEmb(seur,"tsne"),
          md=getMD(seur)$orig.ident,
          md_title="orig.ident",
          md_log=T)
}
```
You can add plots for other metadata features of interest (ie. batch) here by _copying the plot code and setting the `md=` argument to a different column of your metadata_.

Playing with the `perplexity` parameter in `RunTSNE` can improve the visualization.  Perplexity can be interpretted as the number of nearby cells to consider when trying to minimize distance between neighbouring cells.

```{r umap, fig.height=5,fig.width=5}
seur <- RunUMAP(seur,dims=1:n_pc,reduction="pca",verbose=F)
plot_tsne(cell_coord=getEmb(seur,"umap"),
          md=getMD(seur)$sum,
          md_title="sum",
          md_log=T)
if ("orig.ident" %in% colnames(getMD(seur))) {
  plot_tsne(cell_coord=getEmb(seur,"umap"),
          md=getMD(seur)$orig.ident,
          md_title="orig.ident",
          md_log=T)
}
```


## Iterative clustering with scClustViz
Seurat implements an interpretation of SNN-Cliq (https://doi.org/10.1093/bioinformatics/btv088) for clustering of single-cell expression data.  They use PCs to define the distance metric, then embed the cells in a graph where edges between cells (nodes) are weighted based on their similarity (euclidean distance in PCA space).  These edge weights are refined based on Jaccard distance (overlap in local neighbourhoods), and then communities ("quasi-cliques") are identified in the graph using a smart local moving algorithm (SLM, http://dx.doi.org/10.1088/1742-5468/2008/10/P10008) to optimize the modularity measure of the defined communities in the graph.  

This code block iterates through "resolutions" of the Seurat clustering method, testing each for overfitting. Overfitting is determined by testing differential expression between all pairs of clusters using a wilcoxon rank-sum test.  If there are no significantly differentially expressed genes between nearest neighbouring clusters, iterative clustering is stopped.  The output is saved as an sCVdata object for use in scClustViz.

```{r clustering, results="hold"}
max_seurat_resolution <- 0.6 # For the sake of the demo, quit early.
## ^ change this to something large (5?) to ensure iterations stop eventually.
output_filename <- "./for_scClustViz.RData"
FDRthresh <- 0.01 # FDR threshold for statistical tests
min_num_DE <- 1
seurat_resolution <- 0 # Starting resolution is this plus the jump value below.
seurat_resolution_jump <- 0.2

seur <- FindNeighbors(seur,dims=1:n_pc,verbose=F)

sCVdata_list <- list()
DE_bw_clust <- TRUE
while(DE_bw_clust) {
  if (seurat_resolution >= max_seurat_resolution) { break }
  seurat_resolution <- seurat_resolution + seurat_resolution_jump 
  # ^ iteratively incrementing resolution parameter 
  
  seur <- FindClusters(seur,resolution=seurat_resolution,verbose=F)
  
  message(" ")
  message("------------------------------------------------------")
  message(paste0("--------  res.",seurat_resolution," with ",
                 length(levels(Idents(seur)))," clusters --------"))
  message("------------------------------------------------------")
  
  if (length(levels(Idents(seur))) <= 1) { 
    message("Only one cluster found, skipping analysis.")
    next 
  } 
  # ^ Only one cluster was found, need to bump up the resolution!
  
  if (length(sCVdata_list) >= 1) {
    temp_cl <- length(levels(Clusters(sCVdata_list[[length(sCVdata_list)]])))
    if (temp_cl == length(levels(Idents(seur)))) { 
      temp_cli <- length(levels(interaction(
        Clusters(sCVdata_list[[length(sCVdata_list)]]),
        Idents(seur),
        drop=T
      )))
      if (temp_cli == length(levels(Idents(seur)))) { 
        message("Clusters unchanged from previous, skipping analysis.")
        next 
      }
    }
  }
  
  curr_sCVdata <- CalcSCV(
    inD=seur,
    assayType="RNA",
    cl=Idents(seur), 
    # ^ your most recent clustering results get stored in the Seurat "ident" slot
    exponent=2, 
    # ^ since you used scran for normalization, data is in log2 space.
    pseudocount=1,
    DRthresh=0.1,
    DRforClust="pca",
    calcSil=T,
    calcDEvsRest=T,
    calcDEcombn=T
  )
  
  DE_bw_NN <- sapply(DEneighb(curr_sCVdata,FDRthresh),nrow)
  # ^ counts # of DE genes between neighbouring clusters at your selected FDR threshold
  message(paste("Number of DE genes between nearest neighbours:",min(DE_bw_NN)))
  
  if (min(DE_bw_NN) < min_num_DE) { DE_bw_clust <- FALSE }
  # ^ If no DE genes between nearest neighbours, don't loop again.
  
  sCVdata_list[[paste0("res.",seurat_resolution)]] <- curr_sCVdata
}

seur@meta.data <- seur@meta.data[,colnames(seur@meta.data) != "seurat_clusters"]
# cleaning redundant metadata

seur <- DietSeurat(seur,dimreducs=Reductions(seur))
# ^ shrinks the size of the Seurat object by removing the scaled matrix

save(sCVdata_list,seur,file=output_filename)
```

View the scClustViz report by running this code chunk.  Note that the `cellMarkers` argument should be changed to reflect marker genes for your expected clusters.  Or better yet, use a better cluster annotation method (such as [cellassign](https://shahlab.ca/projects/cellassign/), [SingleR](https://bioconductor.org/packages/release/bioc/html/SingleR.html), or [Capybara](https://github.com/morris-lab/Capybara)), then _name the clusters manually as outlined [here](https://baderlab.github.io/scClustViz/#use-your-own-cluster-names)_.

```{r scClustViz, eval=FALSE, include=TRUE}
runShiny(output_filename,
         #change CellMarkers to suit your needs, or remove it and 
         # use ClusterNames to name your clusters (see web for usage).
         cellMarkers=list( 
           "Cortical precursors"=c("Mki67","Sox2","Pax6","Pcna",
                                   "Nes","Cux1","Cux2"), 
           "Interneurons"=c("Gad1","Gad2","Npy","Sst","Lhx6",
                            "Tubb3","Rbfox3","Dcx"), 
           "Cajal-Retzius neurons"="Reln", 
           "Intermediate progenitors"="Eomes", 
           "Projection neurons"=c("Tbr1","Satb2","Fezf2","Bcl11b","Tle4","Nes",
                                  "Cux1","Cux2","Tubb3","Rbfox3","Dcx")
         ),
         annotationDB="org.Mm.eg.db" #"org.Hs.eg.db" for human
)
```