RNA Basecaller for ONT data

Create and activate environmnet

micromamba env create -n basecalling-cuda117 -f envs/basecalling_cuda11.7_pytorch2.yml
micromamba activate basecalling-cuda117

use conda/miniconda/mamba/micromamba

Training

for testing with small datasets

python feito/train.py --path-train data/subsample_train.hdf5 --path-val data/subsample_val.hdf5 --model Rodan --epochs 5 --batch-size 16

python3 feito/train.py --path-train data/RODAN/train/rna-train.hdf5 --path-val data/RODAN/train/rna-valid.hdf5 --epochs 30 --batch-size 16 --num-workers 4 --model SimpleNet --device cuda

with RODAN's dataset

python feito/train.py --path-train data/RODAN/train/rna-train.hdf5 --path-val data/RODAN/train/rna-valid.hdf5 --model Rodan --epochs 20 --batch-size 64 --device cuda

Testing

This test assumes that testing dataset is in the same format than training and validations (`hdf5`` format), i.e. you have split reads with their ground truths.
For experimental purposes use /extdata/RODAN/train/rna-test.hdf5.

RODAN with small dataset

python feito/test.py --path-test data/subsample_val.hdf5 --batch-size 16 --model Rodan --device cpu --path-checkpoint output/training/checkpoints/Rodan-epoch5.pt --path-fasta output/test/basecalled_signals.fa --rna true --use-viterbi true

SimpleNet with small dataset

python feito/test.py --path-test data/subsample_val.hdf5 --batch-size 16 --model SimpleNet --device cpu --path-checkpoint output/training/checkpoints/SimpleNet-epoch1.pt --path-fasta output/test/basecalled_signals_SimpleNet.fa --rna true --use-viterbi true

Basecalling

This assumes you have a trained model, and a set of reads in fast5 format.
Reads will be split by the dataloader in non-overlapping signals with length equal to the input of the model (this must be provided as parameter, but it shouldn't (FIXME:)), and an index will be created, to refer each portion of the basecalled signal to its portion of read.

python feito/basecall.py --path-fast5 data/RODAN/test/mouse-dataset/0 --len-subsignals 4096 --path-index output/basecalling/simplenet-index.csv --batch-size 16 --model SimpleNet --device cpu --path-checkpoint output/training/checkpoints/SimpleNet-epoch30.pt --path-fasta output/basecalling/simplenet-basecalled_reads.fa --path-reads output/basecalling/simplenet-basecalled_reads.fa

Reconstruct full reads from basecalled signals

Since raw signals need to be split into chunks of a fix length, we need to . For this reason, an index for portion of basecalled reads is built during the previous step. Now we need to take those portions of reads plus the index and reconstruct each read by concatenating the portions in the right order.

Mapping reads with minimap2

OPTIONAL

Install minimap2 in a conda environment

micromamba env create -n map-reads -f envs/minimap2.yml
micromamba activate map-reads

map reads to transcriptome

transcriptome="/projects5/basecalling-jorge/basecalling/data/RODAN/test/transcriptomes/mouse_reference.fasta"
reads="/projects5/basecalling-jorge/basecalling/output-old/basecalling/simplenet-basecalled_reads.fa"
samfile="output-old/basecalling/mapped_reads.sam"
minimap2 --secondary=no -ax map-ont -t 32 --cs $transcriptome $reads > $outputsam

`samtools`

sort mapped reads

bamfile="output-old/basecalling/mapped_reads.bam"
samtools view -bS $samfile | samtools sort > $bamfile

indexing

samtools index $bamfile

visualize alignment

samtools view $bamfile | less -S

check statistics of mapped/unnmaped reads

samtools flagstat $bamfile

TODO list

Callbacks:
- Checkpoint: save best model
- Early stopping
Test model: compute accuracy of basecalled reads
- use viterbi (and or beam search) to generate reads from output model
- align basecalled read against ground truth with smith waterman
Create own datasets from raw signals and a reference
New architecture for RNA, consider sampling rate

Info

Basecalling

To map the output of the model to an RNA sequence, use beam search to decode the output of the neural network https://github.com/nanoporetech/fast-ctc-decode

Computation of accuracy

To compare the basecalled read against the ground truth read, use Smith Waterman

Connect to a GPU in the server

qrsh -l gpu_mem=8G

Steps for basecall signals for ONT

Generate dataset for training a supervised model
- Split raw signals in chunks of a fixed size (RODAN uses 4096-long signals)
- Basecall the to obtain a Ground Truth (RODAN basecalle)

TO CONSIDER

How do they influence the architectures?

	sampling rate [samples/sec]	[bp/sec]	[samples/bp]
DNA	4000	450	8.89
RNA	3012	70	43.03

Path to datasets in the server compbio RODAN's dataset

/extdata/RODAN/train/rna-train.hdf5
/extdata/RODAN/train/rna-test.hdf5
/extdata/RODAN/test

Directory where I am working on compbio:/projects5/basecalling-jorge/basecalling

Folder structure

basecalling
├── feito: source code
├── envs: yaml files with different environments that can be installed with conda/miniconda/mamba/micromamba (different versions of pytorch and cuda)
├── data: store data here
├── notebooks: jupyter notebooks to test code
├── output: results of training
├── params.yml: input parameters for train with DVC
└── README.md

Source code

feito
├── api: trainer, tester, basecaller APIs
├── callbacks: functions to be run after each epoch in the training
├── dataloaders: classes to be used with DataLoader from pytorch
├── loss_functions: variants of CTCLoss
├── models: architectures and custom layers
├── utils: accuracy and others
├── feito.py: custom pipeline to basecall reads from fast5 files
├── trainer.py: custom pipeline to train a basecaller
└── tester.py: custom pipeline to test a basecaller ()

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
envs		envs
feito		feito
notebooks		notebooks
rules		rules
scripts		scripts
.gitignore		.gitignore
README.md		README.md
params.yml		params.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RNA Basecaller for ONT data

Create and activate environmnet

Training

Testing

Basecalling

Reconstruct full reads from basecalled signals

Mapping reads with minimap2

OPTIONAL

`samtools`

TODO list

Info

TO CONSIDER

Folder structure

About

Releases

Packages

Languages

jorgeavilacartes/basecalling

Folders and files

Latest commit

History

Repository files navigation

RNA Basecaller for ONT data

Create and activate environmnet

Training

Testing

Basecalling

Reconstruct full reads from basecalled signals

Mapping reads with minimap2

OPTIONAL

samtools

TODO list

Info

TO CONSIDER

Folder structure

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`samtools`

Packages