Name	Name	Last commit message	Last commit date
parent directory ..
archive	archive
barcodes	barcodes
devs	devs
external_tools/Drop-seq_tools-2.1.0	external_tools/Drop-seq_tools-2.1.0
man	man
src	src
subsampling	subsampling
test_files	test_files
README.md	README.md
ReadsToCounts.yml	ReadsToCounts.yml
batch_parameters.csv	batch_parameters.csv
batch_parameters.xlsx	batch_parameters.xlsx
xDbit_filtering.py	xDbit_filtering.py
xDbit_pipeline.sh	xDbit_pipeline.sh
xDbit_run_batch.py	xDbit_run_batch.py

ReadsToCounts pipeline

This pipeline transforms raw NGS sequencing reads from xDbit experiments into count matrices.

Input formats: .fastq or .fastq.gz

Output format: txt.gz

Installation of pipeline
Test run
Preparations
Run ReadsToCounts pipeline
Installation of external tools

Installation of pipeline

Requirements

ReadsToCounts was tested on a Linux machine and the genome alignment steps using STAR require >30 GB RAM.

Installation of external tools is described here.

Create conda environment

# create environment from file
conda env create -f ./ReadsToCounts.yml

# activate environment
conda activate ReadsToCounts

Test run

To test whether the installation worked and the ReadsToCounts pipeline works on your device, it can be run without the alignment steps as follows. Settings for the test run are read from ./batch_parameters.csv.

# test run
conda activate ReadsToCounts
nohup python xDbit_run_batch.py --skip_align --batch_file batch_parameters.csv &

# OR
nohup python xDbit_run_batch.py --skip_align --batch_file batch_parameters.xlsx &

The test run creates a nohup.out file in the current directory and for each batch specified in ./batch_parameters.csv a log file in ./test_files/pipeline_batch{n}_{date}.out. To check the test files one can use cat or tail -f to follow the addition of output in real time.

If the log file does not show any errors and the pipeline runs to the end all necessary packages beside the STAR alignment software is installed and all files, except for the STAR alignment files, are correctly formatted.

Preparations

Preparation of raw sequencing reads

Demultiplexing

It is important to use the --no-lane-splitting option here.

nohup bcl2fastq --runfolder-dir 220111_A00623_0449_BHN3G2DRXY/ --sample-sheet 220111_A00623_0449_BHN3G2DRXY/SampleSheet_HN3G2DRXY.csv --no-lane-splitting --minimum-trimmed-read-length=8 --mask-short-adapter-reads=8 --ignore-missing-positions --ignore-missing-controls --ignore-missing-filter --ignore-missing-bcls -r 20 -w 20 -p 40 &> nohup_demult.out &

Quality control

FastQC analysis

Documentation: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

cd /path/to/fastqs

# run FastQC in background
nohup fastqc -t 40 <fastq_folders>/*fastq.gz &> nohup_fastqc.out &

Summarize results with MultiQC

Documentation: https://multiqc.info/

# run multiqc after FastQC analysis
multiqc <fastq_folders>/

Prepare alignment

Download genome

Example for human genome:

# download and unzip reference genome and annotation file from ensembl
wget ftp://ftp.ensembl.org/pub/release-100/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
wget ftp://ftp.ensembl.org/pub/release-100/gtf/homo_sapiens/Homo_sapiens.GRCh38.100.gtf.gz
gzip -d Homo_sapiens.GRCh38.100.gtf.gz
gzip -d Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

Create STAR metadata

# run metadata script
./path/to/xDbit_toolbox/external_tools/Drop-seq_tools-2.1.0/create_Drop-seq_reference_metadata.sh \
-s Homo_sapiens -n Hs_metadata -r Homo_sapiens.GRCh38.dna.primary_assembly.fa -g Homo_sapiens.GRCh38.100.gtf \
-d ./path/to/xDbit_toolbox/external_tools/Drop-seq_tools-2.1.0

Prepare barcode to coordinate assignment

The xDbit sequencing reads have following structure:

Each barcode on the read lies at a predetermined position relative to the beginning of the read and corresponds to a specific coordinate. In order to assign each read a specific X-, Y-, and if used Z-coordinate, the pipeline requires information about the barcode-to-coordinate assignents.

Two types of files are used to provide this information:

Well-to-coordinate information for each of the nine xDbit samples in .csv format
Barcode-to-well information in barcode_legend_empty.csv file.

Those two files are used to fill the empty barcode legend file with all information required by the pipeline.

The following section gives a more detailed overview of the file structure discussed above.

1. Well-to-coordinate assignment file

Examples of the files can be found under ./barcodes/well_coord_assignments/

Each of the nine files has following structure:

X_Coord	X_WellPosition	Y_Coord	Y_WellPosition	Z_Coord	Z_WellPosition
37		0		0	A1
36	B3	1	B1	0	C4
35	C3	2	C1	0	D6
34	D3	3	D1	0	F6
33	A4	4	A2
…	…	…	…

X_Coord: X-coordinate/column in the xDbit grid
X_WellPosition: Well position in barcoding well plate of the respective ligation round. Needs to correspond to the labels in the barcode legend file.
Y_Coord: Y-coordinate/row in xDbit grid
Y_WellPosition: Well position in barcoding well plate of the respective ligation round. needs to correspond to the labels in the barcode legend file.
Z_Coord: Sample ID.
Z_WellPosition: Well position of barcodes used for the RevT-indexing of this sample.

2. Barcode-to-well assignment file

Example can be found in ./barcodes/barcodes_legend_empty.csv

Structure of the file:

Row	Column	WellPosition	Name	Barcode	string_matching_algorithm	X_maxdist	Y_maxdist	Z_maxdist
A	1	A1	Round1_01	AACGTGAT	levenshtein	1	1	3
B	1	B1	Round1_02	AAACATCG
C	1	C1	Round1_03	ATGCCTAA
D	1	D1	Round1_04	AGTGGTCA
E	1	E1	Round1_05	ACCACTGT
...	...	...	...	...

WellPosition: Well position in barcoding well plates. Assumes that both ligation rounds use the same set of barcodes with the same positions in the well plates.
Barcode: Barcode in this well position.
X/Y/Z: Filled by ./src/fill_barcode_legend.py script.
string_matching_algorithm: Method used to calculate the distance between read sequence and barcodes. Allows hamming or levenshtein.
X/Y/Z_maxdist: Maximum distance between read and barcode being allowed to be considered a match.

Fill empty barcode legend file

In this step a script is used to merge well-to-coordinate and well-to-barcode information to get barcode-to-coordinate information, which is required for the pipeline.

The script uses all .csv files it finds under path/to/well_coord_assignments

# fill the empty barcodes_legend'
python ./src/fill_barcode_legend.py path/to/barcodes_legend_empty.csv path/to/well_coord_assignments/

The output is saved into the input folder.

Run ReadsToCounts pipeline

Prepare batch parameter file

The pipeline can process multiple datasets in parallel. All required information is summarized in a batch parameter file.

An example for the batch parameter file is provided here: ./batch_parameters.csv. Each row of the spreadsheet represent one sample.

Description of all mandatory parameters:

batch: Number of batch in which this sample should be processed. Allows integers starting from 1. How many samples can be run in one batch depends mainly on the RAM provided.
fastq_R1/fastq_R2: Path to fastq file of read 1 or read 2, respectively.
legend: Path to filled barcode legend file.
STAR_genome: Path to directory with STAR genome.
STAR_fasta: Path to STAR metadata .fasta
expected_n_spots: Expected number of spots. This is important to accelerate the calculation of certain QC outputs during analysis.
mode: Allows xDbit and Dbit-seq.
- xDbit: Pipeline takes the Z barcode into account.
- Dbit-seq: Pipeline ignores Z barcode.
pipeline_dir: Path to xDbit_pipeline.sh file.

Run pipeline

# activate environment
conda activate ReadsToCounts

# go to working directory
cd path/to/working-directory

# run batch process in background
nohup python path/to/xDbit_run_batch.py --batch_file path/to/batch_parameter.csv & # output is printed to nohup.out

All output and temporary files are saved into folders out/tmp in the base directory of the fastq_R1 path that was provided in the batch parameters file. Output generated during the pipeline is writtein into the log file pipeline_{}.out in the same folder.

Installation of external tools

FastQC

conda install -c bioconda fastqc

MultiQC

pip install multiqc

Samtools

Download and instructions from: https://www.htslib.org/download/

# download samtools
wget https://github.com/samtools/samtools/releases/download/1.12/samtools-1.12.tar.bz2

# unzip
tar -xf samtools-1.12.tar.bz2

cd samtools-1.x    # and similarly for bcftools and htslib
./configure --without-curses --prefix=/where/to/install
make
make install

# open.bashrc file using a text editor (here nano)
nano ~/.bashrc
# add following command at the end of .bashrc file
export PATH=/where/to/install/bin:$PATH

# save with CTRL+O and exit with CTRL+X

Use samtools for example with following command to show the head of a .bam file:

samtools view file.bam | head

Cutadapt

# install or upgrade cutadapt
conda activate ReadsToCounts
python3 -m pip install --user --upgrade cutadapt

STAR aligner

Manual on: https://github.com/alexdobin/STAR

# Get latest STAR source from releases
wget https://github.com/alexdobin/STAR/archive/2.7.8a.tar.gz
tar -xzf 2.7.8a.tar.gz
cd STAR-2.7.8a

# Compile
cd STAR/source
make STAR

# Recommended: Add STAR to path variables
nano ~/.bashrc
# add following command at the end of .bashrc file
export PATH=$PATH:/path/to/software/STAR/STAR-2.7.Xa/bin/Linux_x86_64/

# On the HPC server it is installed and accessible for the Meier lab using following path variable: 
# export PATH=$PATH:/home/hpc/meier/software/STAR/STAR-2.7.4a/bin/Linux_x86_64/

Drop-seq toolbox

This toolbox is based on the Drop-seq toolbox 2.1.0. The toolbox is included in this repository so it will be downloaded when cloning it.

Alternatively, it could be downloaded from here: https://github.com/broadinstitute/Drop-seq/releases/tag/v2.1.0

Supplementary notes

Create test files

mkdir test_files

# create test files for read 1 and read 2
gzip -d -c 210319_NB552024_0035_AHCJ33AFX2/Data/Intensities/BaseCalls/37_28_Dbitseq_S1_R1_001.fastq.gz | head -n 400000 > test_files/r1_100k.fastq

gzip -d -c 210319_NB552024_0035_AHCJ33AFX2/Data/Intensities/BaseCalls/37_28_Dbitseq_S1_R2_001.fastq.gz | head -n 400000 > test_files/r2_100k.fastq

About Adapter Trimming

I thought about it and did some research on adapter trimming again and came up with following conclusions:

From https://thesequencingcenter.com/wp-content/uploads/2019/04/illumina-adapter-sequences.pdf:
- "When read length exceeds DNA insert size, a run can sequence beyond the DNA insert and read bases fromthe sequencing adapter. To prevent these bases from appearing in FASTQ files, the adapter sequence is trimmed from the 3' ends of reads. Trimming the adapter sequence improves alignment accuracy andperformance in Illumina FASTQ generation pipelines."
From https://emea.support.illumina.com/bulletins/2016/04/adapter-trimming-why-are-adapter-sequences-trimmed-from-only-the--ends-of-reads.html:
- "if the sequencing extends beyond the length of the DNA insert, and into the adapter on the opposite end of the library fragment, that adapter sequence will be found on the 3’ end of the read. Therefore, reads require adapter trimming only on their 3’ ends."
- Only trimming on 3' ends necessary.

In case of Split-seq/Dbit-seq Read 1 starts with the cDNA sequence and ends with the polyA sequence and the following barcodes etc. Therefore, it is necessary to do polyA trimming (as done in the Split-seq pipeline) and adapter trimming would not add any additional value. Read 2 starts with the UMI and continues with the barcodes etc. From this read we extract anyway only the barcodes and don't need it for any alignment. So even if there were any adapters at the end they will not disturb the analysis.

Final conclusion about Adapter Trimming:

No adapter trimming necessary.
Poly(A) trimming is enough.

However, since the FastQC for 37_28 gave 5-10 % adaptor content for "Illumina Universal Adapter" and "Nextera Transposase Sequence" I added adapter trimming to the cutadapt filtering step:

cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -A CTGTCTCTTATACACATCTGACGCTGCCGACGA --minimum-length ${min_lengths} -j 4 \
-o ${outdir}/R1_filtered.fastq.gz -p ${outdir}/R2_filtered.fastq.gz ${r1} ${r2} | tee ${tmpdir}/cutadapt_filter.out

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ReadsToCounts

ReadsToCounts

README.md

ReadsToCounts pipeline

Contents

Installation of pipeline

Requirements

Create conda environment

Test run

Preparations

Preparation of raw sequencing reads

Demultiplexing

Quality control

FastQC analysis

Summarize results with MultiQC

Prepare alignment

Download genome

Create STAR metadata

Prepare barcode to coordinate assignment

1. Well-to-coordinate assignment file

2. Barcode-to-well assignment file

Fill empty barcode legend file

Run ReadsToCounts pipeline

Prepare batch parameter file

Run pipeline

Installation of external tools

FastQC

MultiQC

Samtools

Cutadapt

STAR aligner

Drop-seq toolbox

Supplementary notes

Create test files

About Adapter Trimming

Final conclusion about Adapter Trimming:

Files

ReadsToCounts

Directory actions

More options

Directory actions

More options

Latest commit

History

ReadsToCounts

Folders and files

parent directory

README.md

ReadsToCounts pipeline

Contents

Installation of pipeline

Requirements

Create conda environment

Test run

Preparations

Preparation of raw sequencing reads

Demultiplexing

Quality control

FastQC analysis

Summarize results with MultiQC

Prepare alignment

Download genome

Create STAR metadata

Prepare barcode to coordinate assignment

1. Well-to-coordinate assignment file

2. Barcode-to-well assignment file

Fill empty barcode legend file

Run ReadsToCounts pipeline

Prepare batch parameter file

Run pipeline

Installation of external tools

FastQC

MultiQC

Samtools

Cutadapt

STAR aligner

Drop-seq toolbox

Supplementary notes

Create test files

About Adapter Trimming

Final conclusion about Adapter Trimming: