Merge pull request #173 from cellgeni/master

All Simon's changes
hemberg-lab · Oct 9, 2020 · 4d84a0e · 4d84a0e
2 parents cc32025 + 49027b3
commit 4d84a0e
Show file tree

Hide file tree

Showing 43 changed files with 753 additions and 638 deletions.
diff --git a/.gitignore b/.gitignore
@@ -3,3 +3,4 @@ nextflow_output
 .DS_*
 */.DS_*
 .nextflow*
+/course_files/website
diff --git a/README.md b/README.md
@@ -8,7 +8,7 @@ The number of computational tools is increasing rapidly and we are doing our bes
 
 ## Web page
 
-__[https://scrnaseq-course.cog.sanger.ac.uk/website/index.html](https://scrnaseq-course.cog.sanger.ac.uk/website/index.html)__
+__[https://scrnaseq-course.cog.sanger.ac.uk/browser.html?shared=data/](https://scrnaseq-course.cog.sanger.ac.uk/browser.html?shared=data/)__
 
 ## Video
 

diff --git a/build-instructions.md b/build-instructions.md
@@ -0,0 +1,68 @@
+## Instructions for Building Course 
+
+### Clone Repository
+In order to download the course. Enter the following command into the directory you want the course to be downloaded into:
+```
+git clone https://github.com/cellgeni/scrnaseq-course-private.git
+```
+
+### Installing the Image
+The course uses a docker image within a singularity environment. In order to ensure you have all the correct
+dependencies installed please download the v4.07 [docker image](https://quay.io/repository/hemberg-group/scrna-seq-course?tab=tags).
+
+A specific [version of singularity](https://github.com/hpcng/singularity/tree/v3.5.3) is needed.
+
+There are also [instructions](https://github.com/hpcng/singularity/blob/v3.5.3/INSTALL.md) for installing singularity.
+
+The software nextflow is also used to build the course which has its own [installation instructions](https://www.nextflow.io/docs/latest/getstarted.html).
+
+### How to Build the Course
+In order to build the course and generate new cache files please input the following code into a file (i.e. run-course):
+```
+vi run-course
+```
+then copy the following code into the file:
+```
+#!/bin/bash
+
+source=cellgeni/scrnaseq-course-private
+source=/path-to-current-directory/scrnaseq-course-private/main.nf
+
+export PATH=$PATH:/path-to-installed-singularity-software/singularity-v3.5.3/bin/
+
+set -euo pipefail
+
+nextflow run $source -profile singularity -with-report reports/report.html -resume -ansi-log false
+```
+
+To then run this code, use the following command:
+```
+/path-to-directory-containing-run-course-file/run-course
+```
+
+Or if the file is in your current directory then you can use:
+```
+./run-course
+```
+
+This should build the course. The work directory will be provided at the end and the newly built
+cache files will be located at:
+```
+/path-to-work-dir/course_work_dir/_bookdown_files/
+```
+
+### How to Upload Newly Built Cache to Amazon S3 Bucket
+If you need to upload new cache to the github repo then you will need the AWS Access Key ID and 
+AWS Secret Access Key (not provided here).
+
+Then you need to start a singularity shell using the following command:
+```
+SINGULARITYENV_AWS_ACCESS_KEY_ID=NOT-PROVIDED  \
+SINGULARITYENV_AWS_SECRET_ACCESS_KEY=STILL-NOT-PROVIDED  \
+/path-to-installed-singularity-software/singularity-v3.5.3/bin/singularity shell -B /any-paths-that-need-to-be mounted /path-to-docker-images/quay.io-hemberg-group-scrna-seq-course-v4.07.img
+```
+
+Once you have the shell started use the following command to upload new cache:
+```
+aws s3 sync /path-to-work-dir/course_work_dir/_bookdown_files/ s3://scrnaseq-course/_bookdown_files/
+```
diff --git a/conf/base.config b/conf/base.config
@@ -5,11 +5,3 @@ process {
   cpus   =  2
   memory =  16.GB
 }
-
-// sanger S3 configuration
-aws {
-  client {
-    endpoint = "https://cog.sanger.ac.uk"
-    signerOverride = "S3SignerType"
-  }
-}
diff --git a/conf/sanger-singularity.config b/conf/sanger-singularity.config
@@ -0,0 +1,9 @@
+env {
+    http_proxy = 'http://wwwcache.sanger.ac.uk:3128'
+    https_proxy = 'http://wwwcache.sanger.ac.uk:3128'
+}
+
+singularity {
+    cacheDir = '/nfs/cellgeni/singularity/images_vlad/'
+    runOptions = '-B /lustre/ -B /nfs/users/nfs_c/cellgeni-su'
+}
diff --git a/conf/singularity.config b/conf/singularity.config
@@ -1,13 +1,7 @@
 
 docker.enabled = false
 
-env {
-    http_proxy = 'http://wwwcache.sanger.ac.uk:3128'
-    https_proxy = 'http://wwwcache.sanger.ac.uk:3128'
-}
-
 singularity {
     enabled = true
     autoMounts = true
-    cacheDir = '/nfs/cellgeni/singularity/images_vlad/'
 }
diff --git a/course_files/Intro-TabulaMuris.Rmd b/course_files/Intro-TabulaMuris.Rmd
@@ -2,6 +2,11 @@
 output: html_document
 ---
 
+```{r intro-tab0, echo=FALSE, cache=TRUE, cache.extra = list(R.version, sessionInfo())}
+library(knitr)
+opts_chunk$set(cache=TRUE, cache.extra = list(R.version, sessionInfo()))
+```
+
 # Tabula Muris
 
 ## Introduction
@@ -17,19 +22,21 @@ Unlike most single-cell RNASeq data Tabula Muris has release their data through
 
 Terminal-based download of FACS data: 
 
-```{bash message=FALSE, warning=FALSE, results='hide'}
+```{bash intro-tab1, message=FALSE, warning=FALSE, results='hide'}
+echo $http_proxy
+echo $https_proxy
 wget https://ndownloader.figshare.com/files/10038307
-unzip 10038307
+unzip -o 10038307
 wget https://ndownloader.figshare.com/files/10038310
 mv 10038310 FACS_metadata.csv
 wget https://ndownloader.figshare.com/files/10039267
 mv 10039267 FACS_annotations.csv
 ```
 
 Terminal-based download of 10X data:
-```{bash, message=FALSE, warning=FALSE, results='hide'}
+```{bash intro-tab2, message=FALSE, warning=FALSE, results='hide'}
 wget https://ndownloader.figshare.com/files/10038325
-unzip 10038325
+unzip -o 10038325
 wget https://ndownloader.figshare.com/files/10038328
 mv 10038328 droplet_metadata.csv
 wget https://ndownloader.figshare.com/files/10039264
@@ -39,11 +46,11 @@ mv 10039264 droplet_annotation.csv
 Note if you download the data by hand you should unzip & rename the files as above before continuing.
 
 You should now have two folders : "FACS" and "droplet" and one annotation and metadata file for each. To inspect these files you can use the `head` to see the top few lines of the text files (Press "q" to exit):
-```{bash}
+```{bash intro-tab3}
 head -n 10 droplet_metadata.csv
 ```
 You can also check the number of rows in each file using:
-```{bash}
+```{bash intro-tab4}
 wc -l droplet_annotation.csv
 ```
 
@@ -58,27 +65,27 @@ Droplet : 42,193 cells
 
 We can now read in the relevant count matrix from the comma-separated file. Then inspect the resulting dataframe:
 
-```{r}
+```{r intro-tab5}
 dat = read.delim("FACS/Kidney-counts.csv", sep=",", header=TRUE)
 dat[1:5,1:5]
 ```
 We can see that the first column in the dataframe is the gene names, so first we move these to the rownames so we have a numeric matrix:
 
-```{r}
+```{r intro-tab6}
 dim(dat)
 rownames(dat) <- dat[,1]
 dat <- dat[,-1]
 ```
 
 Since this is a Smartseq2 dataset it may contain spike-ins so lets check:
 
-```{r}
+```{r intro-tab7}
 rownames(dat)[grep("^ERCC-", rownames(dat))]
 ```
 
 Now we can extract much of the metadata for this data from the column names:
 
-```{r}
+```{r intro-tab8}
 cellIDs <- colnames(dat)
 cell_info <- strsplit(cellIDs, "\\.")
 Well <- lapply(cell_info, function(x){x[1]})
@@ -88,19 +95,19 @@ Mouse <- unlist(lapply(cell_info, function(x){x[3]}))
 ```
 We can check the distributions of each of these metadata classifications:
 
-```{r}
+```{r intro-tab9}
 summary(factor(Mouse))
 ```
 
 We can also check if any technical factors are confounded:
 
-```{r}
+```{r intro-tab10}
 table(Mouse, Plate)
 ```
 
 Lastly we will read the computationally inferred cell-type annotation and match them to the cell in our expression matrix:
 
-```{r}
+```{r intro-tab11}
 ann <- read.table("FACS_annotations.csv", sep=",", header=TRUE)
 ann <- ann[match(cellIDs, ann[,1]),]
 celltype <- ann[,3]
@@ -109,7 +116,7 @@ celltype <- ann[,3]
 ## Building a scater object
 To create a SingleCellExperiment object we must put together all the cell annotations into a single dataframe, since the experimental batch (pcr plate) is completely confounded with donor mouse we will only keep one of them.
 
-```{r, message=FALSE, warning=FALSE}
+```{r intro-tab12, message=FALSE, warning=FALSE}
 library("SingleCellExperiment")
 library("scater")
 cell_anns <- data.frame(mouse = Mouse, well=Well, type=celltype)
@@ -118,7 +125,7 @@ sceset <- SingleCellExperiment(assays = list(counts = as.matrix(dat)), colData=c
 ```
 
 Finally if the dataset contains spike-ins we a hidden variable in the SingleCellExperiment object to track them:
-```{r}
+```{r intro-tab13}
 isSpike(sceset, "ERCC") <- grepl("ERCC-", rownames(sceset))
 
 ```
@@ -134,57 +141,57 @@ respectively.
 
 We will be using the "Matrix" package to store matrices in sparse-matrix format in R.
 
-```{r}
+```{r intro-tab14}
 library("Matrix")
 cellbarcodes <- read.table("droplet/Kidney-10X_P4_5/barcodes.tsv")
 genenames <- read.table("droplet/Kidney-10X_P4_5/genes.tsv")
 molecules <- readMM("droplet/Kidney-10X_P4_5/matrix.mtx")
 ```
 Now we will add the appropriate row and column names. However, if you inspect the read cellbarcodes you will see that they are just the barcode sequence associated with each cell. This is a problem since each batch of 10X data uses the same pool of barcodes so if we need to combine data from multiple 10X batches the cellbarcodes will not be unique. Hence we will attach the batch ID to each cell barcode:
-```{r}
+```{r intro-tab15}
 head(cellbarcodes)
 ```
 
-```{r}
+```{r intro-tab16}
 rownames(molecules) <- genenames[,1]
 colnames(molecules) <- paste("10X_P4_5", cellbarcodes[,1], sep="_")
 ```
 Now lets get the metadata and computational annotations for this data:
 
-```{r}
+```{r intro-tab17}
 meta <- read.delim("droplet_metadata.csv", sep=",", header = TRUE)
 head(meta)
 ```
 Here we can see that we need to use "10X_P4_5" to find the metadata for this batch, also note that the format of the mouse ID is different in this metadata table with hyphens instead of underscores and with the gender in the middle of the ID. From checking the methods section of the accompanying paper we know that the same 8 mice were used for both droplet and plate-based techniques. So we need to fix the mouse IDs to be consistent with those used in the FACS experiments. 
 
-```{r}
+```{r intro-tab18}
 meta[meta$channel == "10X_P4_5",]
 mouseID <- "3_8_M"
 ```
 Note: depending on the tissue you have been assigned you may have 10X data from mixed samples : e.g. mouse id = 3-M-5/6. You should still reformat these to be consistent but they will not match mouse ids from the FACS data which may affect your downstream analysis. If the mice weren't from an inbred strain it would be possible to assign individual cells to a specific mouse using exonic-SNPs but that is beyond the scope of this course.
 
-```{r}
+```{r intro-tab19}
 ann <- read.delim("droplet_annotation.csv", sep=",", header=TRUE)
 head(ann)
 ```
 Again you will find a slight formating difference between the cellID in the annotation and the cellbarcodes which we will have to correct before matching them.
 
-```{r}
+```{r intro-tab20}
 ann[,1] <- paste(ann[,1], "-1", sep="")
 ann_subset <- ann[match(colnames(molecules), ann[,1]),]
 celltype <- ann_subset[,3]
 ```
 
 Now lets build the cell-metadata dataframe:
-```{r}
+```{r intro-tab21}
 cell_anns <- data.frame(mouse = rep(mouseID, times=ncol(molecules)), type=celltype)
 rownames(cell_anns) <- colnames(molecules);
 ```
 
 __Exercise__ Repeat the above for the other 10X batches for your tissue.
 
 __Answer__
-```{r, echo=FALSE, eval=TRUE}
+```{r intro-tab22, echo=FALSE, eval=TRUE}
 molecules1 <- molecules
 cell_anns1 <- cell_anns
 
@@ -221,21 +228,21 @@ cell_anns3 <- cell_anns
 
 Now that we have read the 10X data in multiple batches we need to combine them into a single SingleCellExperiment object. First we will check that the gene names are the same and in the same order across all batches:
 
-```{r}
+```{r intro-tab23}
 identical(rownames(molecules1), rownames(molecules2))
 identical(rownames(molecules1), rownames(molecules3))
 ```
 
 Now we'll check that there aren't any repeated cellIDs:
-```{r}
+```{r intro-tab24}
 sum(colnames(molecules1) %in% colnames(molecules2))
 sum(colnames(molecules1) %in% colnames(molecules3))
 sum(colnames(molecules2) %in% colnames(molecules3))
 ```
 
 Everything is ok, so we can go ahead and combine them:
 
-```{r}
+```{r intro-tab25}
 all_molecules <- cbind(molecules1, molecules2, molecules3)
 all_cell_anns <- as.data.frame(rbind(cell_anns1, cell_anns2, cell_anns3))
 all_cell_anns$batch <- rep(c("10X_P4_5", "10X_P4_6","10X_P7_5"), times = c(nrow(cell_anns1), nrow(cell_anns2), nrow(cell_anns3)))
@@ -245,13 +252,13 @@ __Exercise__
 How many cells are in the whole dataset?
 
 __Answer__
-```{r, echo=FALSE, eval=FALSE}
+```{r intro-tab26, echo=FALSE, eval=FALSE}
 dim(all_molecules)[2]
 ```
 
 Now build the SingleCellExperiment object. One of the advantages of the SingleCellExperiment class is that it is capable of storing data in normal matrix or sparse matrix format, as well as HDF5 format which allows large non-sparse matrices to be stored & accessed on disk in an efficient manner rather than loading the whole thing into RAM.
 
-```{r}
+```{r intro-tab27}
 all_molecules <- as.matrix(all_molecules)
 sceset <- SingleCellExperiment(
     assays = list(counts = as.matrix(all_molecules)),
@@ -260,7 +267,7 @@ sceset <- SingleCellExperiment(
 ```
 
 Since this is 10X data it will not contain spike-ins, so we just save the data:
-```{r}
+```{r intro-tab28}
 saveRDS(sceset, "kidney_droplet.rds")
 ```
-Original file line number
+Diff line change
@@ Expand Up / @@ -3,3 +3,4 @@ nextflow_output @@
     .DS_*
     */.DS_*
     .nextflow*
+    /course_files/website