Skip to content

Commit

Permalink
Merge pull request #173 from cellgeni/master
Browse files Browse the repository at this point in the history
All Simon's changes
  • Loading branch information
Vladimir Kiselev authored Oct 9, 2020
2 parents cc32025 + 49027b3 commit 4d84a0e
Show file tree
Hide file tree
Showing 43 changed files with 753 additions and 638 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@ nextflow_output
.DS_*
*/.DS_*
.nextflow*
/course_files/website
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ The number of computational tools is increasing rapidly and we are doing our bes

## Web page

__[https://scrnaseq-course.cog.sanger.ac.uk/website/index.html](https://scrnaseq-course.cog.sanger.ac.uk/website/index.html)__
__[https://scrnaseq-course.cog.sanger.ac.uk/browser.html?shared=data/](https://scrnaseq-course.cog.sanger.ac.uk/browser.html?shared=data/)__

## Video

Expand Down
68 changes: 68 additions & 0 deletions build-instructions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
## Instructions for Building Course

### Clone Repository
In order to download the course. Enter the following command into the directory you want the course to be downloaded into:
```
git clone https://github.com/cellgeni/scrnaseq-course-private.git
```

### Installing the Image
The course uses a docker image within a singularity environment. In order to ensure you have all the correct
dependencies installed please download the v4.07 [docker image](https://quay.io/repository/hemberg-group/scrna-seq-course?tab=tags).

A specific [version of singularity](https://github.com/hpcng/singularity/tree/v3.5.3) is needed.

There are also [instructions](https://github.com/hpcng/singularity/blob/v3.5.3/INSTALL.md) for installing singularity.

The software nextflow is also used to build the course which has its own [installation instructions](https://www.nextflow.io/docs/latest/getstarted.html).

### How to Build the Course
In order to build the course and generate new cache files please input the following code into a file (i.e. run-course):
```
vi run-course
```
then copy the following code into the file:
```
#!/bin/bash
source=cellgeni/scrnaseq-course-private
source=/path-to-current-directory/scrnaseq-course-private/main.nf
export PATH=$PATH:/path-to-installed-singularity-software/singularity-v3.5.3/bin/
set -euo pipefail
nextflow run $source -profile singularity -with-report reports/report.html -resume -ansi-log false
```

To then run this code, use the following command:
```
/path-to-directory-containing-run-course-file/run-course
```

Or if the file is in your current directory then you can use:
```
./run-course
```

This should build the course. The work directory will be provided at the end and the newly built
cache files will be located at:
```
/path-to-work-dir/course_work_dir/_bookdown_files/
```

### How to Upload Newly Built Cache to Amazon S3 Bucket
If you need to upload new cache to the github repo then you will need the AWS Access Key ID and
AWS Secret Access Key (not provided here).

Then you need to start a singularity shell using the following command:
```
SINGULARITYENV_AWS_ACCESS_KEY_ID=NOT-PROVIDED \
SINGULARITYENV_AWS_SECRET_ACCESS_KEY=STILL-NOT-PROVIDED \
/path-to-installed-singularity-software/singularity-v3.5.3/bin/singularity shell -B /any-paths-that-need-to-be mounted /path-to-docker-images/quay.io-hemberg-group-scrna-seq-course-v4.07.img
```

Once you have the shell started use the following command to upload new cache:
```
aws s3 sync /path-to-work-dir/course_work_dir/_bookdown_files/ s3://scrnaseq-course/_bookdown_files/
```
8 changes: 0 additions & 8 deletions conf/base.config
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,3 @@ process {
cpus = 2
memory = 16.GB
}

// sanger S3 configuration
aws {
client {
endpoint = "https://cog.sanger.ac.uk"
signerOverride = "S3SignerType"
}
}
9 changes: 9 additions & 0 deletions conf/sanger-singularity.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
env {
http_proxy = 'http://wwwcache.sanger.ac.uk:3128'
https_proxy = 'http://wwwcache.sanger.ac.uk:3128'
}

singularity {
cacheDir = '/nfs/cellgeni/singularity/images_vlad/'
runOptions = '-B /lustre/ -B /nfs/users/nfs_c/cellgeni-su'
}
6 changes: 0 additions & 6 deletions conf/singularity.config
Original file line number Diff line number Diff line change
@@ -1,13 +1,7 @@

docker.enabled = false

env {
http_proxy = 'http://wwwcache.sanger.ac.uk:3128'
https_proxy = 'http://wwwcache.sanger.ac.uk:3128'
}

singularity {
enabled = true
autoMounts = true
cacheDir = '/nfs/cellgeni/singularity/images_vlad/'
}
67 changes: 37 additions & 30 deletions course_files/Intro-TabulaMuris.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,11 @@
output: html_document
---

```{r intro-tab0, echo=FALSE, cache=TRUE, cache.extra = list(R.version, sessionInfo())}
library(knitr)
opts_chunk$set(cache=TRUE, cache.extra = list(R.version, sessionInfo()))
```

# Tabula Muris

## Introduction
Expand All @@ -17,19 +22,21 @@ Unlike most single-cell RNASeq data Tabula Muris has release their data through

Terminal-based download of FACS data:

```{bash message=FALSE, warning=FALSE, results='hide'}
```{bash intro-tab1, message=FALSE, warning=FALSE, results='hide'}
echo $http_proxy
echo $https_proxy
wget https://ndownloader.figshare.com/files/10038307
unzip 10038307
unzip -o 10038307
wget https://ndownloader.figshare.com/files/10038310
mv 10038310 FACS_metadata.csv
wget https://ndownloader.figshare.com/files/10039267
mv 10039267 FACS_annotations.csv
```

Terminal-based download of 10X data:
```{bash, message=FALSE, warning=FALSE, results='hide'}
```{bash intro-tab2, message=FALSE, warning=FALSE, results='hide'}
wget https://ndownloader.figshare.com/files/10038325
unzip 10038325
unzip -o 10038325
wget https://ndownloader.figshare.com/files/10038328
mv 10038328 droplet_metadata.csv
wget https://ndownloader.figshare.com/files/10039264
Expand All @@ -39,11 +46,11 @@ mv 10039264 droplet_annotation.csv
Note if you download the data by hand you should unzip & rename the files as above before continuing.

You should now have two folders : "FACS" and "droplet" and one annotation and metadata file for each. To inspect these files you can use the `head` to see the top few lines of the text files (Press "q" to exit):
```{bash}
```{bash intro-tab3}
head -n 10 droplet_metadata.csv
```
You can also check the number of rows in each file using:
```{bash}
```{bash intro-tab4}
wc -l droplet_annotation.csv
```

Expand All @@ -58,27 +65,27 @@ Droplet : 42,193 cells

We can now read in the relevant count matrix from the comma-separated file. Then inspect the resulting dataframe:

```{r}
```{r intro-tab5}
dat = read.delim("FACS/Kidney-counts.csv", sep=",", header=TRUE)
dat[1:5,1:5]
```
We can see that the first column in the dataframe is the gene names, so first we move these to the rownames so we have a numeric matrix:

```{r}
```{r intro-tab6}
dim(dat)
rownames(dat) <- dat[,1]
dat <- dat[,-1]
```

Since this is a Smartseq2 dataset it may contain spike-ins so lets check:

```{r}
```{r intro-tab7}
rownames(dat)[grep("^ERCC-", rownames(dat))]
```

Now we can extract much of the metadata for this data from the column names:

```{r}
```{r intro-tab8}
cellIDs <- colnames(dat)
cell_info <- strsplit(cellIDs, "\\.")
Well <- lapply(cell_info, function(x){x[1]})
Expand All @@ -88,19 +95,19 @@ Mouse <- unlist(lapply(cell_info, function(x){x[3]}))
```
We can check the distributions of each of these metadata classifications:

```{r}
```{r intro-tab9}
summary(factor(Mouse))
```

We can also check if any technical factors are confounded:

```{r}
```{r intro-tab10}
table(Mouse, Plate)
```

Lastly we will read the computationally inferred cell-type annotation and match them to the cell in our expression matrix:

```{r}
```{r intro-tab11}
ann <- read.table("FACS_annotations.csv", sep=",", header=TRUE)
ann <- ann[match(cellIDs, ann[,1]),]
celltype <- ann[,3]
Expand All @@ -109,7 +116,7 @@ celltype <- ann[,3]
## Building a scater object
To create a SingleCellExperiment object we must put together all the cell annotations into a single dataframe, since the experimental batch (pcr plate) is completely confounded with donor mouse we will only keep one of them.

```{r, message=FALSE, warning=FALSE}
```{r intro-tab12, message=FALSE, warning=FALSE}
library("SingleCellExperiment")
library("scater")
cell_anns <- data.frame(mouse = Mouse, well=Well, type=celltype)
Expand All @@ -118,7 +125,7 @@ sceset <- SingleCellExperiment(assays = list(counts = as.matrix(dat)), colData=c
```

Finally if the dataset contains spike-ins we a hidden variable in the SingleCellExperiment object to track them:
```{r}
```{r intro-tab13}
isSpike(sceset, "ERCC") <- grepl("ERCC-", rownames(sceset))
```
Expand All @@ -134,57 +141,57 @@ respectively.

We will be using the "Matrix" package to store matrices in sparse-matrix format in R.

```{r}
```{r intro-tab14}
library("Matrix")
cellbarcodes <- read.table("droplet/Kidney-10X_P4_5/barcodes.tsv")
genenames <- read.table("droplet/Kidney-10X_P4_5/genes.tsv")
molecules <- readMM("droplet/Kidney-10X_P4_5/matrix.mtx")
```
Now we will add the appropriate row and column names. However, if you inspect the read cellbarcodes you will see that they are just the barcode sequence associated with each cell. This is a problem since each batch of 10X data uses the same pool of barcodes so if we need to combine data from multiple 10X batches the cellbarcodes will not be unique. Hence we will attach the batch ID to each cell barcode:
```{r}
```{r intro-tab15}
head(cellbarcodes)
```

```{r}
```{r intro-tab16}
rownames(molecules) <- genenames[,1]
colnames(molecules) <- paste("10X_P4_5", cellbarcodes[,1], sep="_")
```
Now lets get the metadata and computational annotations for this data:

```{r}
```{r intro-tab17}
meta <- read.delim("droplet_metadata.csv", sep=",", header = TRUE)
head(meta)
```
Here we can see that we need to use "10X_P4_5" to find the metadata for this batch, also note that the format of the mouse ID is different in this metadata table with hyphens instead of underscores and with the gender in the middle of the ID. From checking the methods section of the accompanying paper we know that the same 8 mice were used for both droplet and plate-based techniques. So we need to fix the mouse IDs to be consistent with those used in the FACS experiments.

```{r}
```{r intro-tab18}
meta[meta$channel == "10X_P4_5",]
mouseID <- "3_8_M"
```
Note: depending on the tissue you have been assigned you may have 10X data from mixed samples : e.g. mouse id = 3-M-5/6. You should still reformat these to be consistent but they will not match mouse ids from the FACS data which may affect your downstream analysis. If the mice weren't from an inbred strain it would be possible to assign individual cells to a specific mouse using exonic-SNPs but that is beyond the scope of this course.

```{r}
```{r intro-tab19}
ann <- read.delim("droplet_annotation.csv", sep=",", header=TRUE)
head(ann)
```
Again you will find a slight formating difference between the cellID in the annotation and the cellbarcodes which we will have to correct before matching them.

```{r}
```{r intro-tab20}
ann[,1] <- paste(ann[,1], "-1", sep="")
ann_subset <- ann[match(colnames(molecules), ann[,1]),]
celltype <- ann_subset[,3]
```

Now lets build the cell-metadata dataframe:
```{r}
```{r intro-tab21}
cell_anns <- data.frame(mouse = rep(mouseID, times=ncol(molecules)), type=celltype)
rownames(cell_anns) <- colnames(molecules);
```

__Exercise__ Repeat the above for the other 10X batches for your tissue.

__Answer__
```{r, echo=FALSE, eval=TRUE}
```{r intro-tab22, echo=FALSE, eval=TRUE}
molecules1 <- molecules
cell_anns1 <- cell_anns
Expand Down Expand Up @@ -221,21 +228,21 @@ cell_anns3 <- cell_anns

Now that we have read the 10X data in multiple batches we need to combine them into a single SingleCellExperiment object. First we will check that the gene names are the same and in the same order across all batches:

```{r}
```{r intro-tab23}
identical(rownames(molecules1), rownames(molecules2))
identical(rownames(molecules1), rownames(molecules3))
```

Now we'll check that there aren't any repeated cellIDs:
```{r}
```{r intro-tab24}
sum(colnames(molecules1) %in% colnames(molecules2))
sum(colnames(molecules1) %in% colnames(molecules3))
sum(colnames(molecules2) %in% colnames(molecules3))
```

Everything is ok, so we can go ahead and combine them:

```{r}
```{r intro-tab25}
all_molecules <- cbind(molecules1, molecules2, molecules3)
all_cell_anns <- as.data.frame(rbind(cell_anns1, cell_anns2, cell_anns3))
all_cell_anns$batch <- rep(c("10X_P4_5", "10X_P4_6","10X_P7_5"), times = c(nrow(cell_anns1), nrow(cell_anns2), nrow(cell_anns3)))
Expand All @@ -245,13 +252,13 @@ __Exercise__
How many cells are in the whole dataset?

__Answer__
```{r, echo=FALSE, eval=FALSE}
```{r intro-tab26, echo=FALSE, eval=FALSE}
dim(all_molecules)[2]
```

Now build the SingleCellExperiment object. One of the advantages of the SingleCellExperiment class is that it is capable of storing data in normal matrix or sparse matrix format, as well as HDF5 format which allows large non-sparse matrices to be stored & accessed on disk in an efficient manner rather than loading the whole thing into RAM.

```{r}
```{r intro-tab27}
all_molecules <- as.matrix(all_molecules)
sceset <- SingleCellExperiment(
assays = list(counts = as.matrix(all_molecules)),
Expand All @@ -260,7 +267,7 @@ sceset <- SingleCellExperiment(
```

Since this is 10X data it will not contain spike-ins, so we just save the data:
```{r}
```{r intro-tab28}
saveRDS(sceset, "kidney_droplet.rds")
```

Expand Down
Loading

0 comments on commit 4d84a0e

Please sign in to comment.