Skip to content

Latest commit

 

History

History
80 lines (70 loc) · 3.83 KB

README.md

File metadata and controls

80 lines (70 loc) · 3.83 KB

Classification of SARS-CoV-2 sequences using FCGR and CNN

Frequence Chaos Game Representation with Deep Learning

Try the trained models

A web app is available with all the trained models, you just need to upload a fasta file with your sequences

Data

  • Sequences and metadata must be downloaded from GISAID after creating an account and accepting the Terms of Use.
  • Reference sequence can be downloaded from here.
  • List of variant markers for each clade are save in mutations_reference.json and can be found here

Before running the snakemake file, make sure to add them to parameters.yaml

PATH_FASTA_GISAID: "path/to/sequences.fasta"
PATH_METADATA: "path/to/metadata.tsv"
PATH_REFERENCE_GENOME: "path/to/reference.fasta"

Create a virtual environment and install packages

python -m venv env
source env/bin/activate
pip install -r requirements.txt

Set parameters for the experiment in parameters.yaml

  • See (and include) preprocessing functions at preprocessing.py

Run

snakemake -p -c1

to visualize a DAG with the rules

snakemake --forceall --dag | dot -Tpdf > dag.pdf

Snakefile runs codes in this order

  1. undersample_sequences.py
  2. extract_sequences.py (extract each undersample sequence in individuals fasta files)
  3. fasta2fcgr (generates a npy file with the $k$th-FCGR for each extracted sequence in the previous step)
  4. split_data.py (will create a file datasets.json with train, validation and test sets)
  5. train.py (train the model for the $k$-mer selected)
  6. test.py
  7. classification_metrics.py (computes accuracy, precision, recall and f1-score)
  8. clustering_metrics.py (computes Silhouette score, Calinski-Harabaz and Generalized Discrimination Value in the test set)
  9. plots.py (generates plot for accuracy and loss in the training and validation sets. Confusion matrix for the test set)
  10. saliency_map.py and shap_values.py (feature importance methods)
  11. svm_experiment.py (train a SVM using subsets of relevant kmers chosen by the feature importance methods)
  12. match_relevant_kmers.py (match relevant kmers chose by the feature importance methods to the list of marker variants for each clade)

A folder data/ will be created to save all intermediate results:

data/
├── fcgr-6-mer
├── hCoV-19
├── matches
├── plots
├── saliency_map
├── shap_values
├── svm
├── test
└── train