micromamba env create -n basecalling-cuda117 -f envs/basecalling_cuda11.7_pytorch2.yml
micromamba activate basecalling-cuda117
use conda
/miniconda
/mamba
/micromamba
for testing with small datasets
python feito/train.py --path-train data/subsample_train.hdf5 --path-val data/subsample_val.hdf5 --model Rodan --epochs 5 --batch-size 16
python3 feito/train.py --path-train data/RODAN/train/rna-train.hdf5 --path-val data/RODAN/train/rna-valid.hdf5 --epochs 30 --batch-size 16 --num-workers 4 --model SimpleNet --device cuda
with RODAN's dataset
python feito/train.py --path-train data/RODAN/train/rna-train.hdf5 --path-val data/RODAN/train/rna-valid.hdf5 --model Rodan --epochs 20 --batch-size 64 --device cuda
- This test assumes that testing dataset is in the same format than training and validations (`hdf5`` format), i.e. you have split reads with their ground truths.
- For experimental purposes use
/extdata/RODAN/train/rna-test.hdf5
.
RODAN with small dataset
python feito/test.py --path-test data/subsample_val.hdf5 --batch-size 16 --model Rodan --device cpu --path-checkpoint output/training/checkpoints/Rodan-epoch5.pt --path-fasta output/test/basecalled_signals.fa --rna true --use-viterbi true
SimpleNet with small dataset
python feito/test.py --path-test data/subsample_val.hdf5 --batch-size 16 --model SimpleNet --device cpu --path-checkpoint output/training/checkpoints/SimpleNet-epoch1.pt --path-fasta output/test/basecalled_signals_SimpleNet.fa --rna true --use-viterbi true
- This assumes you have a trained model, and a set of reads in fast5 format.
- Reads will be split by the dataloader in non-overlapping signals with length equal to the input of the model (this must be provided as parameter, but it shouldn't (FIXME:)), and an index will be created, to refer each portion of the basecalled signal to its portion of read.
python feito/basecall.py --path-fast5 data/RODAN/test/mouse-dataset/0 --len-subsignals 4096 --path-index output/basecalling/simplenet-index.csv --batch-size 16 --model SimpleNet --device cpu --path-checkpoint output/training/checkpoints/SimpleNet-epoch30.pt --path-fasta output/basecalling/simplenet-basecalled_reads.fa --path-reads output/basecalling/simplenet-basecalled_reads.fa
Since raw signals need to be split into chunks of a fix length, we need to . For this reason, an index for portion of basecalled reads is built during the previous step. Now we need to take those portions of reads plus the index and reconstruct each read by concatenating the portions in the right order.
Install minimap2 in a conda environment
micromamba env create -n map-reads -f envs/minimap2.yml
micromamba activate map-reads
map reads to transcriptome
transcriptome="/projects5/basecalling-jorge/basecalling/data/RODAN/test/transcriptomes/mouse_reference.fasta"
reads="/projects5/basecalling-jorge/basecalling/output-old/basecalling/simplenet-basecalled_reads.fa"
samfile="output-old/basecalling/mapped_reads.sam"
minimap2 --secondary=no -ax map-ont -t 32 --cs $transcriptome $reads > $outputsam
sort mapped reads
bamfile="output-old/basecalling/mapped_reads.bam"
samtools view -bS $samfile | samtools sort > $bamfile
indexing
samtools index $bamfile
visualize alignment
samtools view $bamfile | less -S
check statistics of mapped/unnmaped reads
samtools flagstat $bamfile
- Callbacks:
- Checkpoint: save best model
- Early stopping
- Test model: compute accuracy of basecalled reads
- use viterbi (and or beam search) to generate reads from output model
- align basecalled read against ground truth with smith waterman
- Create own datasets from raw signals and a reference
- New architecture for RNA, consider sampling rate
Basecalling
To map the output of the model to an RNA sequence, use beam search to decode the output of the neural network https://github.com/nanoporetech/fast-ctc-decode
Computation of accuracy
To compare the basecalled read against the ground truth read, use Smith Waterman
Connect to a GPU in the server
qrsh -l gpu_mem=8G
Steps for basecall signals for ONT
- Generate dataset for training a supervised model
- Split raw signals in chunks of a fixed size (RODAN uses 4096-long signals)
- Basecall the to obtain a Ground Truth (RODAN basecalle)
How do they influence the architectures?
sampling rate [samples/sec] | [bp/sec] | [samples/bp] | |
---|---|---|---|
DNA | 4000 | 450 | 8.89 |
RNA | 3012 | 70 | 43.03 |
Path to datasets in the server compbio RODAN's dataset
/extdata/RODAN/train/rna-train.hdf5
/extdata/RODAN/train/rna-test.hdf5
/extdata/RODAN/test
Directory where I am working on
compbio:/projects5/basecalling-jorge/basecalling
basecalling
├── feito: source code
├── envs: yaml files with different environments that can be installed with conda/miniconda/mamba/micromamba (different versions of pytorch and cuda)
├── data: store data here
├── notebooks: jupyter notebooks to test code
├── output: results of training
├── params.yml: input parameters for train with DVC
└── README.md
Source code
feito
├── api: trainer, tester, basecaller APIs
├── callbacks: functions to be run after each epoch in the training
├── dataloaders: classes to be used with DataLoader from pytorch
├── loss_functions: variants of CTCLoss
├── models: architectures and custom layers
├── utils: accuracy and others
├── feito.py: custom pipeline to basecall reads from fast5 files
├── trainer.py: custom pipeline to train a basecaller
└── tester.py: custom pipeline to test a basecaller ()