Fixed up some of the documentation, added the saved parameters thingy.

BaderLab · Aug 17, 2018 · 5987d4a · 5987d4a
1 parent 753f5ed
commit 5987d4a
Show file tree

Hide file tree

Showing 8 changed files with 583 additions and 340 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -2,7 +2,7 @@ Package: scClustViz
 Type: Package
 Title: Differential Expression-based scRNAseq Cluster Assessment and Viewing
 Version: 0.2.0
-Date: 2018-08-16
+Date: 2018-08-17
 Authors@R: c(as.person("Brendan T. Innes <[email protected]> [aut,cre]"),
              as.person("Gary D. Bader [aut,ths]"))
 Description: An interactive R Shiny tool for visualizing single-cell RNAseq clustering 

diff --git a/R/deTest.R b/R/deTest.R
@@ -1,80 +1,157 @@
 #' Cluster-wise differential expression testing
 #'
-#' Performs differential expression testing between clusters for all cluster solutions in 
-#' order to assess the biological relevance of each cluster solution. Differential 
-#' expression testing is done using the Wilcoxon rank-sum test implemented in the base R 
-#' \code{stats} package. For details about what is being compared in the tests, see the 
-#' "Value" section.
+#' Performs differential expression testing between clusters for all cluster
+#' solutions in order to assess the biological relevance of each cluster
+#' solution. Differential expression testing is done using the Wilcoxon rank-sum
+#' test implemented in the base R \code{stats} package. For details about what
+#' is being compared in the tests, see the "Value" section.
 #'
-#' @param il The list outputted by one of the importData functions (either 
+#' @param il The list outputted by one of the importData functions (either
 #'   \code{\link{readFromSeurat}} or \code{\link{readFromManual}}).
-#'   
-#' @param testAll Logical value indicating whether to test all cluster solutions 
-#'   (\code{TRUE}) or stop testing once a cluster solution has been found where there is 
-#'   no differentially expressed genes found between at least one pair of nearest 
-#'   neighbouring clusters (\code{FALSE}). \emph{If set to (\code{FALSE}), only the 
-#'   cluster solutions tested will appear in the scClustViz shiny app.}
-#' 
-#' @param exponent The log base of your normalized input data. Seurat normalization uses 
-#'   the natural log (set this to exp(1)), while other normalization methods generally use 
-#'   log2 (set this to 2).
-#' 
-#' @param pseudocount The pseudocount added to all log-normalized values in your input 
-#'   data. Most methods use a pseudocount of 1 to eliminate log(0) errors.
-#' 
-#' @param FDRthresh The false discovery rate to use as a threshold for determining statistical 
-#'   significance of differential expression calculated by the Wilcoxon rank-sum test.
-#' 
-#' @param threshType Filtering genes for use in differential expression testing can be 
-#'   done multiple ways. We use an expression ratio filter for comparing each cluster to 
-#'   the rest of the tissue as a whole, but find that difference in detection rates works 
-#'   better when comparing clusters to each other. You can set threshType to 
-#'   \code{"logGER"} to use a gene expression ratio for all gene filtering, or leave it as 
-#'   default (\code{"dDR"}) to use difference in detection rate as the thresholding method 
-#'   when comparing clusters to each other.
-#' 
-#' @param dDRthresh Magnitude of detection rate difference of a gene between clusters to 
-#'   use as filter for determining which genes to test for differential expression between 
-#'   clusters.
-#' 
-#' @param logGERthresh Magnitude of gene expression ratio for a gene between clusters to 
-#'   use as filter for determining which genes to test for differential expression between 
-#'   clusters.
-#' 
-#' @return The function returns a list containing the results of differential expression 
-#'   testing for all sets of cluster solutions. \emph{Saving both the input (the object passed 
-#'   to the \code{il} argument) and the output of this function to an RData file is all 
-#'   the preparation necessary for running the scClustViz Shiny app itself.}
-#'   The output list of this function contains the following elements:
-#'   \describe{
-#'     \item{CGS}{} 
-#'     \item{deTissue}{} 
-#'     \item{deVS}{}
-#'     \item{deMarker}{}
-#'     \item{deDist}{} 
-#'     \item{deNeighb}{} 
+#'
+#' @param testAll Default = TRUE. Logical value indicating whether to test all
+#'   cluster solutions (\code{TRUE}) or stop testing once a cluster solution has
+#'   been found where there is no differentially expressed genes found between
+#'   at least one pair of nearest neighbouring clusters (\code{FALSE}). \emph{If
+#'   set to (\code{FALSE}), only the cluster solutions tested will appear in the
+#'   scClustViz shiny app.}
+#'
+#' @param exponent Default = 2. The log base of your normalized input data.
+#'   Seurat normalization uses the natural log (set this to exp(1)), while other
+#'   normalization methods generally use log2 (set this to 2).
+#'
+#' @param pseudocount Default = 1. The pseudocount added to all log-normalized
+#'   values in your input data. Most methods use a pseudocount of 1 to eliminate
+#'   log(0) errors.
+#'
+#' @param FDRthresh Default = 0.01. The false discovery rate to use as a
+#'   threshold for determining statistical significance of differential
+#'   expression calculated by the Wilcoxon rank-sum test.
+#'
+#' @param threshType Default = "dDR". Filtering genes for use in differential
+#'   expression testing can be done multiple ways. We use an expression ratio
+#'   filter for comparing each cluster to the rest of the tissue as a whole, but
+#'   find that difference in detection rates works better when comparing
+#'   clusters to each other. You can set threshType to \code{"logGER"} to use a
+#'   gene expression ratio for all gene filtering, or leave it as default
+#'   (\code{"dDR"}) to use difference in detection rate as the thresholding
+#'   method when comparing clusters to each other.
+#'
+#' @param dDRthresh Default = 0.15. Magnitude of detection rate difference of a
+#'   gene between clusters to use as filter for determining which genes to test
+#'   for differential expression between clusters.
+#'
+#' @param logGERthresh Default = 1. Magnitude of gene expression ratio for a
+#'   gene between clusters to use as filter for determining which genes to test
+#'   for differential expression between clusters.
+#'
+#' @return The function returns a list containing the results of differential
+#'   expression testing for all sets of cluster solutions. \emph{Saving both the
+#'   input (the object passed to the \code{il} argument) and the output of this
+#'   function to an RData file is all the preparation necessary for running the
+#'   scClustViz Shiny app itself.} The output list of this function contains the
+#'   following elements: 
+#'   \describe{ 
+#'     \item{CGS}{A nested list of dataframes. Each list element is named for 
+#'       a column in \code{il$cl} (a cluster resolution). That list element 
+#'       contains a named list of clusters at that resolution. Each of those 
+#'       list elements contains a dataframe of three variables, where each 
+#'       sample is a gene. \code{DR} is the proportion of cells in the cluster 
+#'       in which that gene was detected. \code{MDTC} is mean normalized gene 
+#'       expression for that gene in only the cells in which it was detected 
+#'       (see \link{meanLogX} for mean calculation). \code{MTC} is the mean 
+#'       normalized gene expression for that gene in all cells of the cluster 
+#'       (see \link{meanLogX} for mean calculation).} 
+#'     \item{deTissue}{Differential testing results from Wilcoxon rank sum tests 
+#'       comparing a gene in each cluster to the rest of the cells as a whole in 
+#'       a one vs all comparison. The results are stored as a nested list of 
+#'       dataframes. Each list element is named for a column in \code{il$cl} (a 
+#'       cluster resolution). That list element contains a named list of 
+#'       clusters at that resolution. Each of those list elements contains a 
+#'       dataframe of three variables, where each sample is a gene. 
+#'       \code{logGER} is the log gene expression ratio calculated by 
+#'       subtracting the mean expression of the gene (see \link{meanLogX} for 
+#'       mean calculation) in all other cells from the mean expression of the 
+#'       gene in this cluster. \code{pVal} is the p-value of the Wilcoxon rank 
+#'       sum test. \code{qVal} is the false discovery rate-corrected p-value of 
+#'       the test.} 
+#'     \item{deVS}{Differential testing results from Wilcoxon rank sum tests 
+#'       comparing a gene in each cluster to that gene in every other cluster in 
+#'       a series of tests. The results are stored as a nested list of 
+#'       dataframes. Each list element is named for a column in \code{il$cl} (a 
+#'       cluster resolution). That list element contains a named list of 
+#'       clusters at that resolution (cluster A). Each of those lists contains a 
+#'       named list of all the other clusters at that resolution (cluster B). 
+#'       Each of those list elements contains a dataframe of four variables, 
+#'       where each sample is a gene. \code{dDR} is the difference in detection 
+#'       rate of that gene between the two clusters (DR[A] - DR[B]). 
+#'       \code{logGER} is the log gene expression ratio calculated by taking the 
+#'       difference in mean expression of the gene (see \link{meanLogX} for 
+#'       mean calculation) between the two clusters (MTC[A] - MTC[B]). 
+#'       \code{pVal} is the p-value of the Wilcoxon rank sum test. \code{qVal} 
+#'       is the false discovery rate-corrected p-value of the test.}
+#'     \item{deMarker}{Differential testing results from Wilcoxon rank sum tests 
+#'       comparing a gene in each cluster to that gene in every other cluster in 
+#'       a series of tests, and filtering for only those genes that show 
+#'       significant positive differential expression versus all other clusters. 
+#'       The results are stored as a nested list of dataframes. Each list 
+#'       element is named for a column in \code{il$cl} (a cluster resolution). 
+#'       That list element contains a named list of clusters at that resolution 
+#'       (cluster A). Each of those list elements contains a dataframe where 
+#'       variables represent comparisons to all the other clusters and each 
+#'       sample is a gene. For each other cluster (cluster B), there are three 
+#'       variables, named as follows: \code{vs.B.dDR} is the difference in 
+#'       detection rate of that gene between the two clusters (DR[A] - DR[B]). 
+#'       \code{vs.B.logGER} is the log gene expression ratio calculated by 
+#'       taking the difference in mean expression of the gene (see 
+#'       \link{meanLogX} for mean calculation) between the two clusters (MTC[A] 
+#'       - MTC[B]). \code{vs.B.qVal} is the false discovery rate-corrected 
+#'       p-value of the Wilcoxon rank sum test.} 
+#'     \item{deDist}{A named list of distances between clusters for each cluster 
+#'       resolution. Distances are calculated as number of differentially 
+#'       expressed genes between clusters.} 
+#'     \item{deNeighb}{Differential testing results from Wilcoxon rank sum tests 
+#'       comparing a gene in each cluster to that gene in its nearest 
+#'       neighbouring cluster (calculated by number of differentially expressed 
+#'       genes), and filtering for only those genes that show significant 
+#'       positive differential expression versus all other clusters. The results 
+#'       are stored as a nested list of dataframes. Each list element is named 
+#'       for a column in \code{il$cl} (a cluster resolution). That list element 
+#'       contains a named list of clusters at that resolution (cluster A). Each 
+#'       of those list elements contains a dataframe where variables represent 
+#'       the comparison to its nearest neighbouring cluster (cluster B) and each 
+#'       sample is a gene. There are three variables, named as follows: 
+#'       \code{vs.B.dDR} is the difference in detection rate of that gene 
+#'       between the two clusters (DR[A] - DR[B]). \code{vs.B.logGER} is the log 
+#'       gene expression ratio calculated by taking the difference in mean 
+#'       expression of the gene (see \link{meanLogX} for mean calculation) 
+#'       between the two clusters (MTC[A] - MTC[B]). \code{vs.B.qVal} is the 
+#'       false discovery rate-corrected p-value of the Wilcoxon rank sum test.}
+#'     \item{params}{A list of the parameters from the argument list of this 
+#'       function used to do the analysis, saved so that the same parameters are 
+#'       used in the Shiny app.} 
 #'   }
-#' 
-#' @examples 
+#'
+#' @examples
 #' \dontrun{
-#'  data_for_scClustViz <- readFromSeurat(your_seurat_object,
-#'                                        convertGeneIDs=F)
-#'  rm(your_seurat_object) 
+#'  data_for_scClustViz <- readFromSeurat(your_seurat_object)
+#'  rm(your_seurat_object)
 #'  # All the data scClustViz needs is in 'data_for_scClustViz'.
-#'  
+#'
 #'  DE_for_scClustViz <- clusterWiseDEtest(data_for_scClustViz)
-#'  
+#'
 #'  save(data_for_scClustViz,DE_for_scClustViz,
 #'       file="for_scClustViz.RData")
 #'  # Save these objects so you'll never have to run this slow function again!
-#'  
+#'
 #'  runShiny(filePath="for_scClustViz.RData")
 #' }
-#' 
-#' @seealso \code{\link{readFromSeurat}} or \code{\link{readFromManual}} for reading in 
-#'   data to generate the input object for this function, and \code{\link{runShiny}} to 
-#'   use the interactive Shiny GUI to view the results of this testing.
-#' 
+#'
+#' @seealso \code{\link{readFromSeurat}} or \code{\link{readFromManual}} for
+#'   reading in data to generate the input object for this function, and
+#'   \code{\link{runShiny}} to use the interactive Shiny GUI to view the results
+#'   of this testing.
+#'
 #' @export
 
 
@@ -87,7 +164,13 @@ clusterWiseDEtest <- function(il,testAll=TRUE,
   options(warn=-1)
 
   out <- list(CGS=list(),deTissue=list(),deVS=list(),
-              deMarker=list(),deDist=list(),deNeighb=list())
+              deMarker=list(),deDist=list(),deNeighb=list(),
+              params=list(exponent=exponent,
+                          pseudocount=pseudocount,
+                          FDRthresh=FDRthresh,
+                          threshType=threshType,
+                          dDRthresh=dDRthresh,
+                          logGERthresh=logGERthresh))
   # This loop iterates through every cluster solution, and does DE testing between clusters
   # to generate the DE metrics for assessing your clusters.  This takes some time.
   for (res in colnames(il[["cl"]])) {