diff --git a/vignettes/single_cell_rnaseq_basic.Rmd b/vignettes/single_cell_rnaseq_basic.Rmd deleted file mode 100644 index 9bc10b1..0000000 --- a/vignettes/single_cell_rnaseq_basic.Rmd +++ /dev/null @@ -1,329 +0,0 @@ ---- -title: "Analysis of single-cell RNA-seq data using a topic model, Part 1: basic concepts" -author: Peter Carbonetto -date: "`r Sys.Date()`" -output: - html_document: - toc: yes - toc_float: no - highlight: textmate - theme: readable -vignette: > - %\VignetteIndexEntry{Analysis of single-cell RNA-seq data using a topic model, Part 1: basic concepts} - %\VignetteEngine{knitr::rmarkdown} - %\VignetteEncoding{UTF-8} ---- - -The aim of this vignette is to introduce the basic concepts behind an -analysis of single-cell RNA-seq data using a topic model, and to show -how to use 'fastTopics' to implement this analysis. This first -vignette explains the analysis steps at only a high level---see -[Part 2][vignette-part-2] for additional explanations and guidance. - -Since a topic model analysis is quite different from most conventional -analyses of single-cell RNA-seq data, we point out key -differences. - -One important difference is that a topic model is a model of count -data, so *the topic model should be applied directly to the count -data.* In contrast, many methods require preprocessing of the count -data. - -```{r knitr-opts, include=FALSE} -knitr::opts_chunk$set(comment = "#",collapse = TRUE,results = "hold", - fig.align = "center",dpi = 120) -``` - -We begin our analysis by loading the packages. Then we set the seed so -that the results can be reproduced. - -```{r load-pkgs, message=FALSE, warning=FALSE} -library(Matrix) -library(fastTopics) -library(ggplot2) -library(cowplot) -set.seed(1) -``` - -The example data set --------------------- - -We will illustrate the concepts using a single-cell RNA-seq data set -from [Zheng *et al* (2017)][zheng-2017]. These data are reference -transcriptome profiles from 10 bead-enriched subpopulations of -peripheral blood mononuclear cells (PBMCs). The original data set is -much larger---for this introduction, we have taken a small subset of -approximately 3,700 cells. - -The data we will analyze are unique molecular identifier (UMI) -counts. These data are stored as an $n \times m$ sparse matrix, where -$n$ is the number of cells and $m$ is the number of genes: - -```{r load-data} -data(pbmc_facs) -counts <- pbmc_facs$counts -dim(counts) -``` - -The UMI counts are "sparse"---that is, most of the counts are -zero. Indeed, over 95% of the UMI counts are zero: - -```{r nonzeros} -mean(counts > 0) -``` - -No need to normalize or transform ---------------------------------- - -Most analyses of single-cell RNA-seq data involve a preprocessing step -in which the UMI counts are log-transformed and normalized. *When -analyzing UMI counts with a topic model, you should not normalize or -transform the counts.* - -Additionally, analyses of single-cell RNA-seq data typically select -only the most highly variable genes. *We recommend instead using all -or most genes.* The exception are genes in which all the UMI counts -are zero---these genes should be removed prior to a topic model -analysis. - -Fit the topic model -------------------- - -Since no pre-processing is needed, we can move directly to the next -step of the analysis: fitting the topic model to the UMI count -data. This is accomplished with a call to `fit_topic_model`: - -```{r fit-topic-model, eval=FALSE} -fit <- fit_topic_model(counts,k = 6) -``` - -Note it may take several minutes to fit the topic model on this data -set. For convenience, we saved the output from this call: - -```{r load-fit} -fit <- pbmc_facs$fit -``` - -To fit a topic model, we must specify $K$, the number of topics. -Here, we have chosen $K = 6$ topics. In most settings, a good choice -of $K$ will not be known in advance, so you will you want to explore -the results from topic models at different settings of $K$. - -The `fit_topic_model` interface is intended to hide the details of -model fitting, and it should work well for many data sets. Larger or -more complex data sets my require some fine-tuning of the model -fitting. See [Part 2][vignette-part-2] for more on this. - -Each cell is represented as a unique combination of topics ----------------------------------------------------------- - -A key feature of the topic model is that each cell $i$ is represented -as a *unique combination of the topics.* Therefore, each cell can be -summarized by the $K$ topic proportions. These cell-specific -proportions are learned from the data. In 'fastTopics', the topic -proportions for all cells are stored in an $n \times K$ matrix: - -```{r loadings-1} -dim(fit$L) -``` - -To illustrate, here is a cell in which the pattern of expression is -almost fully captured by the fourth topic: - -```{r loadings-2} -rows <- "GATATATGTCAGTG-1-b_cells" -round(fit$L[rows,],digits = 3) -``` - -Here are two more examples: - -```{r loadings-3} -rows <- c("GACAGTACCTGTGA-1-memory_t", - "TGAAGCACACAGCT-1-b_cells") -round(fit$L[rows,],digits = 3) -``` - -The last example is interesting because the observed expression is -best captured by a combination of topics 4 and 6. - -To make sense of these results, we first need to understand the -biological relevance of the topics, which we explore next. - -Interpreting topics using available cell labels ------------------------------------------------ - -In some cases, you may have additional information about the cells, -such as the tissue the cells were sampled from. (For interpreting -topics when cells are not labeled, see [Part 2][vignette-part-2].) In -the PBMC data, the cells were labeled using fluorescence-activated -cell sorting (FACS). Each cell is assigned one of five labels (each -corresponding to a cell type): B cells, CD14+ monocytes, CD34+ cells, -natural killer (NK) cells, and T cells. - -```{r summary-subpop} -samples <- pbmc_facs$samples -summary(samples$subpop) -``` - -Create a "Structure plot" to visualize the relationship between the -cell labels and the topic proportions: - -```{r structure-plot-with-celltype-labels, fig.width=7.5, fig.height=1.75, results="hide", message=FALSE} -topic_colors <- c("skyblue","forestgreen","darkmagenta","dodgerblue", - "gold","darkorange") -structure_plot(fit,colors = topic_colors,topics = 1:6,gap = 25, - grouping = samples$subpop) -``` - -The Structure plot is a stacked bar chart in which each topic is -represented as a bar of a different colour, and the bar heights are -the topic proportions. Patterns begin to emerge when the cells are -arranged so that cells with similar topic proportions are positioned -close to each other. This arrangement is automated in `structure_plot`. - -From the Structure plot, we see that topics 1 (light blue), 2 (green) -and 4 (dark blue) closely correspond to NK, CD14+ and B cell -types, respectively; expression in these cells is largely -explained by a single topic. - -The topic model also captures interesting substructure within the T -cells: most T cells are explained by at least two topics, with wide -variation in the proportions for these two topics; also, there is a -subset of T cells that are explained by three topics (topics 1, 5 and -6). - -Predicting topics in other cells --------------------------------- - -Having fitted a topic model to a single-cell data set, we can use that -model to predict topics for cells that were not used to train the -model. We illustrate this with a separate collection of 100 cells -drawn from the same PBMC data set: - -```{r test-data} -counts_test <- pbmc_facs$counts_test -dim(counts_test) -``` - -The "predict" function estimates the topic proportions based on a -previously fit topic model. - -```{r predict, results="hide", message=FALSE} -Ltest <- predict(fit,counts_test) -``` - -The predictions appear to align well with the provided cell labels: - -```{r structure-plot-test, fig.width=7.5, fig.height=1.75, results="hide", message=FALSE} -fit_test <- list(F = fit$F,L = Ltest) -class(fit_test) <- c("multinom_topic_model_fit","list") -structure_plot(fit_test,topics = 1:6,colors = topic_colors,gap = 2, - grouping = pbmc_facs$samples_test$subpop) -``` - -Topics capture patterns of relative expression ----------------------------------------------- - -Each topic is represented as a vector of $m$ expression levels which -are stored as an $m \times K$ matrix: - -```{r factors-1} -dim(fit$F) -``` - -These expression levels are comparable across topics. For example, -genes *CD79A* are *CD79B* are important to B cells, so we would expect -higher expression in topic 4: - -```{r factors-2} -genes <- pbmc_facs$genes -rbind(fit$F[genes$symbol == "CD79B",], - fit$F[genes$symbol == "CD79A",]) -``` - -The most informative genes are those with higher expression in one -topic compared to the other topics. In the next section we explain how -to identify these genes. - -Annotating topics by differentially expressed genes ---------------------------------------------------- - -Consider a standard differential expression (DE) analysis of cell -types: for a given cell type, the log-fold change (LFC) is estimated -as the (base-2) log-ratio of two expression levels, the expression in -cells belonging to the cell type over the expression in cells not -belonging to the cell type. - -'fastTopics' extends the standard DE analysis to allow for *partial -membership* to multiple groups (the methods are described -[here][single-cell-topics-paper]). This "grade of membership" (GoM) -differential expression analysis is implemented by the -function `de_analysis`: - -```{r de-analysis-1, eval=FALSE} -set.seed(1) -de <- de_analysis(fit,counts,pseudocount = 0.1, - control = list(ns = 1e4,nc = 4)) -``` - -Since this computation can take 10 minutes or more to run, we have -provided the output of this `de_analysis` call in the `pbmc_facs` data -set: - -```{r de-analysis-2} -de <- pbmc_facs$de -``` - -The GoM DE analysis generates LFC estimates that have a similar -interpretation to a standard DE analysis. To illustrate, the GoM DE -analysis estimates strong overexpression of B-cell genes *CD79A* and -*CD79B* in topic 4: - -```{r diff-count-analysis-3} -rbind(de$postmean[genes$symbol == "CD79A",], - de$postmean[genes$symbol == "CD79B",]) -``` - -Ultimately, we would like to discover genes most relevant to a topic, -not just report LFC estimates for known genes. A common approach is to -visualize the results of the DE analysis using a volcano plot. For -example, here is the volcano plot for topic 4: - -```{r volcano-plot-bcells, fig.height=4, fig.width=4.5} -volcano_plot(de,k = 4,labels = genes$symbol) -``` - -Typically, the most interesting genes are found on the right-hand side -of the volcano plot---that is, genes with large LFC---so long as they -have a small lfsr (local false sign rate). Indeed, B cell gene *CD79A* -is on the far right-hand side of the plot. - -Likewise, natural killer genes such as *NKG7* and *GNLY* emerge on the -right-hand side of the volcano plot for topic 1: - -```{r volcano-plot-nk, fig.height=4, fig.width=4.5} -volcano_plot(de,k = 1,labels = genes$symbol,ymax = 100) -``` - -Some of the genes with the largest LFCs are *CD3D*, *CD3E* and *CD8B* -which suggest topic 6 is capturing expression specific to T cells: - -```{r volcano-plot-tcells, fig.height=4, fig.width=4.5} -volcano_plot(de,k = 6,labels = genes$symbol,ymax = 100) -``` - -See also [Part 2][vignette-part-2] for more DE analysis examples. - -Session info ------------- - -This is the version of R and the packages that were used to generate -these results. - -```{r session-info} -sessionInfo() -``` - -[vignette-part-2]: https://stephenslab.github.io/fastTopics/articles/single_cell_rnaseq_practical.html -[zheng-2017]: https://doi.org/10.1038/ncomms14049 -[single-cell-topics-paper]: https://doi.org/10.1101/2023.03.03.531029 diff --git a/vignettes/single_cell_rnaseq_practical.Rmd b/vignettes/single_cell_rnaseq_practical.Rmd deleted file mode 100644 index aea4bd1..0000000 --- a/vignettes/single_cell_rnaseq_practical.Rmd +++ /dev/null @@ -1,231 +0,0 @@ ---- -title: "Analysis of single-cell RNA-seq data using a topic model, Part 2: practical implementation" -author: Peter Carbonetto -date: "`r Sys.Date()`" -output: - html_document: - toc: yes - toc_float: no - highlight: textmate - theme: readable -vignette: > - %\VignetteIndexEntry{Analysis of single-cell RNA-seq data using a topic model, Part 2: practical implementation} - %\VignetteEngine{knitr::rmarkdown} - %\VignetteEncoding{UTF-8} ---- - -This vignette continues from [Part 1][vignette-part-1], where we -introduced the basic steps of a topic modeling analysis for -single-cell data. Here we give more detailed guidance and discuss -complications that may arise. - -```{r knitr-opts, include=FALSE} -knitr::opts_chunk$set(comment = "#",collapse = TRUE,results = "hold", - fig.align = "center",dpi = 120) -``` - -We begin our analysis by loading the packages. Then we set the seed so -that the results can be reproduced. - -```{r load-pkgs, message=FALSE, warning=FALSE} -library(Matrix) -library(fastTopics) -library(ggplot2) -library(cowplot) -set.seed(1) -``` - -Now we load the data set and the $K = 6$ pre-fitted topic model. - -```{r load-data} -data(pbmc_facs) -counts <- pbmc_facs$counts -fit <- pbmc_facs$fit -``` - -Assessing convergence of model fitting --------------------------------------- - -The topic model was fitted using the `fit_topic_model` function. This -function, as we mentioned, hides most of the complexities of model -fitting. Nevertheless, it is a good idea to check that the model -fitting has converged to a maximum-likelihood solution. This is -easily done with 'fastTopics': - -```{r plot-loglik, fig.height=2, fig.width=4} -plot_progress(fit,x = "iter",add.point.every = 10,colors = "black") + - theme_cowplot(font_size = 10) -``` - -Internally, `fit_topic_model` first fits a non-negative matrix -factorization (NMF) to the count data, then converts the fitted NMF to -a topic model, so the plot shows the improvement in the model fit over -time. Judging by this plot, the parameter estimates get close to a -maxium-likelihood solution after about 150 updates. - -For larger or more complex data sets, more updates may be needed to -obtain a good fit. For guidance, see `help(fit_topic_model)` and -`help(fit_poisson_nmf)`. - -Evaluating fit: single-cell likelihoods ---------------------------------------- - -The topic model can be used to calculate *a likelihood for each cell:* - -```{r loglik-1} -loglik <- loglik_multinom_topic_model(counts,fit) -``` - -This can be used to assess how well the topic model "fits" each cell. - -```{r loglik-2, fig.height=2, fig.width=4.5} -pdat <- data.frame(loglik) -ggplot(pdat,aes(loglik)) + - geom_histogram(bins = 64,color = "white",fill = "black",size = 0.25) + - labs(y = "number of cells") + - theme_cowplot(font_size = 10) -``` - -Most of the poorly fit cells are in the CD34+ subpopulation: - -```{r loglik-3, fig.height=2, fig.width=5.5} -subpop_colors <- c("dodgerblue","forestgreen","darkmagenta","skyblue","gold") -pdat <- data.frame(loglik = loglik,subpop = pbmc_facs$samples$subpop) -ggplot(pdat,aes(x = loglik,fill = subpop)) + - geom_histogram(bins = 64,color = "white",size = 0.25) + - scale_fill_manual(values = subpop_colors) + - labs(y = "number of cells") + - theme_cowplot(font_size = 10) -``` - -Visualizing the topics without cell labels ------------------------------------------- - -In [Part 1][single_cell_rnaseq_practical], we made use of additional -information about the cells to help us interpret the topics. Even -without the cell labels, the Structure plot can still be effective: - -```{r structure-plot-without-labels, fig.width=6.5, fig.height=1.5, results="hide", message=FALSE} -topic_colors <- c("skyblue","forestgreen","darkmagenta","dodgerblue", - "gold","darkorange") -structure_plot(fit,colors = topic_colors) -``` - -From this Structure plot, the topics clearly distinguish the B cells -(dark blue), CD14+ monocytes (green) and CD34+ cells (purple). The NK -cells (light blue) and T cells (cells high proportions of yellow and -orange) are harder to distinguish, and this is reflected in sharing of -topics (mostly topic 6) among NK and T cells. - -Of course, without the cell labels, we cannot know that the topics -correspond to these cell types without further downstream -analyses---for this, we performed a GoM DE analysis (see -[Part 1][vignette-part-1]). - -Identifying clusters from the topic proportions ------------------------------------------------ - -In more complex data sets, some structure may not show up well without -additional fine-tuning of the plot. - -One simple strategy that often works well is to first identify -clusters in the topic proportions matrix, $L$. Here we use $k$-means -to identify clusters, but other methods could be used. - -```{r clustering-1} -set.seed(1) -pca <- prcomp(fit$L)$x -clusters <- kmeans(pca,centers = 6,iter.max = 100)$cluster -summary(factor(clusters)) -``` - -We chose 6 clusters, but to be clear the best number of clusters may -not be the same as the number of topics. - -Now we create another Structure plot in which the cells are grouped -according to the clusters: - -```{r structure-plot-by-cluster-1, fig.width=7.5, fig.height=1.75, results="hide", message=FALSE} -structure_plot(fit,topics = 1:6,colors = topic_colors,gap = 25, - grouping = clusters) + - theme(axis.text.x = element_blank()) -``` - -k-means somewhat arbitrarily split the T cells (cells with high -proportions of topics 5 and 6) into clusters 1 and 3. Therefore, we -merge these two clusters: - -```{r clustering-2} -clusters[clusters == 3] <- 1 -clusters <- factor(clusters) -``` - -We have found that visual inspection of the clusters followed by -manual refinement is often be needed to get the "right" clustering. - -Now we can label the clusters by these cell types in the plot. -(Noting that this labeling is usually not possible until downstream -analyses have been performed.) - -```{r structure-plot-by-cluster-2, fig.width=7.5, fig.height=1.75, results="hide", message=FALSE} -levels(clusters) <- c("T","CD14+","B","CD34+","NK") -structure_plot(fit,topics = 1:6,colors = topic_colors,gap = 25, - grouping = clusters) -``` - -It is also sometimes helpful to visualize the structure in PCA plots -which show the projection of the cells onto PCs of the topic -proportions: - -```{r pca-plot-1, fig.width=3.5, fig.height=2.5, message=FALSE, warning=FALSE} -cluster_colors <- c("gold","forestgreen","dodgerblue","darkmagenta","skyblue") -pca_plot(fit,fill = clusters) + - scale_fill_manual(values = cluster_colors) -``` - -When there are many overlapping points, plotting the *density* can -sometimes show the clusters more clearly: - -```{r pca-plot-2, fig.width=3.75, fig.height=2.5, message=FALSE} -pca_hexbin_plot(fit,bins = 24) -``` - -More on differential expression analysis ----------------------------------------- - -In [Part 1][vignette-part-1], we performed a DE analysis using the -topic model. - -```{r diff-count-analysis} -de <- pbmc_facs$de -``` - -When the volcano plot shows many overlapping differentially expressed -genes, it can be helpful to explore the results interactively. The -function `volcano_plotly` creates an *interactive volcano plot* that can -be viewed in a Web browser. For example, here is an interactive -volcano plot for the T cells topic: - -```{r volcano-plotly, warning=FALSE} -genes <- pbmc_facs$genes -p <- volcano_plotly(de,k = 6,file = "volcano_plot_t_cells.html", - labels = genes$symbol,ymax = 100) -``` - -This call creates a new file `volcano_plot_t_cells.html`. The -interactive volcano plot can also be viewed [here][volcano-plotly], or -by calling `print(p)` in R. - -Session info ------------- - -This is the version of R and the packages that were used to generate -these results. - -```{r session-info} -sessionInfo() -``` - -[vignette-part-1]: https://stephenslab.github.io/fastTopics/articles/single_cell_rnaseq_basic.html -[volcano-plotly]: https://stephenslab.github.io/fastTopics/articles/volcano_plot_t_cells.html -[zheng-2017]: https://doi.org/10.1038/ncomms14049