A pangenome allele-specific and copy number specific genotyping tool
Cite the paper»
Report Bug
·
Give Feedback
·
Ask For Help
·
Suggest include new genes
- Getting Started
- About ctyper
- Repository Overview
- Prerequisites
- Inputs
- Installation
- Usage
- Results Annotation
- Results Visualization
- Cohort Analysis
- License
- Contact
Let us get start to genotype an example sample NA12718
first obtain its cram file
wget "ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239480/NA12718.final.cram"
wget "ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR323/ERR3239480/NA12718.final.cram.crai"
Download relying databases
wget "https://zenodo.org/records/14399353/files/HprcCpcHgsvc_final42_matrix.v1.0.txt.gz"
gunzip HprcCpcHgsvc_final42_matrix.v1.0.txt.gz
wget "https://zenodo.org/records/14399353/files/HprcCpcHgsvc_final42_matrix.v1.0.txt.index"
wget "https://zenodo.org/records/14399353/files/PangenomeAlleles_annotationfix.v1.0.tsv.gz"
gunzip PangenomeAlleles_annotationfix.v1.0.tsv.gz
Download and compile ctyper, making sure you have gcc >= 8 and eigen, zlib, and HTSlib installed
$EIGEN_ROOT is the root path of EIGEN, usally /usr/
git clone https://github.com/Walfred-MA/Ctyper
export EIGEN_ROOT=$EIGEN_ROOT
cd Ctyper/src && make
mv ctyper ..
Now, let's us start to genotype it, $LIBRARY_PATH is your path of LD_LIBRARY, if you are using conda to install HTSlib or samtools, then it should be usally at /home/$user_name/miniconda3/lib/, otherwise have a try at /usr/local/lib/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH
Ctyper/ctyper -i NA12718.final.cram -m HprcCpcHgsvc_final42_matrix.v1.0.txt -c 1 -b Ctyper/data/backgrounds.list -o ctyper.out > log.txt &
After finishing genotyping, let us start to visualize and analyze, making sure you have python3 installed.
first, change the raw output to a text table. The headers are: allele name, allele type, HG38 gene(s), transcriptname(exon_number):transcript_genename:similarity, type category(ref: in same type of HG38 gene, alt: different orthology type, dup: duplicated paralog, novel: diverged paralog), pangenome assembly location, reference liftover location and alignment CIGAR string.
python Ctyper/tools/Annotation/SampleAnnotate.py -i ctyper.out -a PangenomeAlleles_annotationfix.v1.0.tsv -o genotype.txt
second, if you want to have genotypes of HLA, CYP2D6, and KIR.
python Ctyper/tools/Annotation/Nomenclature/GenotypetoNomenclature.py -i genotype.txt -a Ctyper/data/all_nomenclature.txt
example output: (the first column are names of pangenome allele, the second column is their public Nomenclature)
CYP2D_group1_HG01358_h2_162 *1
CYP2D_group1_GW00005_h1_12 *1
HLA-A_group2_HG01071_h2_1333 HLA-A*03:01:01:01
HLA-A_group2_HG00741_h1_1304 HLA-A*11:01:01:01
HLA-B_group10_HG02717_h2_2635 HLA-B*15:03:01:01
HLA-B_group10_HG02055_h2_2334 HLA-B*07:02:01:14
HLA-C_group8_chr6_2356 HLA-C*07:02:01:03
HLA-DM_group1_HG01361_h1_160 HLA-DMA*01:01:01:13,HLA-DMB*01:04
HLA-DM_group1_NA12878_h1_216 HLA-DMA*01:03:01:02,HLA-DMB*01:07:01:02
HLA-DO_group1_NA12878_h2_435 HLA-DOB*01:01:01:24
HLA-DO_group1_HG02572_h1_374 HLA-DOB*01:01:01:11
HLA-DO_group2_chr6_GL000251v2_alt_485 HLA-DOA*01:01:05
HLA-DO_group2_chr6_GL000253v2_alt_490 HLA-DOA*01:01:01:01
HLA-DP_group2_HG00733_h2_790 HLA-DPA1*02:01:01:01,HLA-DPB1*11:01:01:01
HLA-DP_group3_HG00733_h2_789 HLA-DPB2*03:01:01:03
HLA-DQ_group84_HG01071_h1_58107 HLA-DQA2*01:04:01:01,HLA-DQB2*01:01:01:04
HLA-DRA_group1_HG02818_h2_198 HLA-DRA*01:02:02:11
HLA-DRA_group1_chr6_GL000253v2_alt_244 HLA-DRA*01:01:01:02
HLA-E_group6_GW00048_h1_639 HLA-E*01:01:01:01
HLA-E_group6_chr6_GL000255v2_alt_2012 HLA-E*01:01:01:52
HLA-F_group1_HG01258_h2_666 HLA-F*01:03:01:01
HLA-F_group1_GW00057_h1_494 HLA-F*01:01:02:09
HLA-G_group1_HG01071_h2_771 HLA-G*01:01:01:05
HLA-G_group1_GW00038_h1_416 HLA-G*01:01:03:03
HLA-H_group3_HG02630_h2_1530 HLA-H*02:04:01
HLA-H_group3_NA12878_h2_2195 HLA-H*02:07:01:01
HLA-J_group1_GW00027_h1_325 HLA-J*01:01:01:04
HLA-J_group1_GW00043_h1_497 HLA-J*01:01:01:02
HLA-K_group2_HG01258_h2_650 HLA-K*01:01:01:01
HLA-L_group1_GW00022_h2_200 HLA-L*01:03
HLA-L_group1_GW00052_h2_462 HLA-L*01:03
HLA-N_group1_NA24385_h1_230 HLA-N*01:01:01:01
HLA-P_group1_HG01071_h2_425 HLA-P*02:01:01:02
HLA-S_group1_GW00018_h1_38 HLA-S*01:01:01:02
HLA-S_group1_chr6_GL000253v2_alt_241 HLA-S*01:01:01:01
HLA-T_group1_chr6_463 HLA-T*03:01:01:01
HLA-U_group1_HG00732_h1_347 HLA-U*01:05
HLA-U_group1_HG01175_h1_389 HLA-U*01:04
HLA-V_group1_HG01071_h2_641 HLA-V*01:01:01:01
HLA-W_group1_HG01071_h2_308 HLA-W*01:01:01:05
KIR2DL_group1_HG01952_h1_188 KIR2DL1*0040101,KIR2DL2*0010102,KIR2DL4*0080204,KIR2DL5B*0020109
KIR2DL_group1_chr19_KI270933v1_alt_348 KIR2DL1*0020103,KIR2DL3*0020103,KIR2DL4*0080204
KIR3DL1_group1_HG01123_h2_259 KIR3DL1*0040201
KIR3DL1_group1_HG00741_h2_246 KIR3DL1*0040101
Then let Plot SMN genotyping results as mutant map (make sure you have pandas and matplotlib installed)
(optional) download genecode gff3 annotation
wget "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/gencode.v47.annotation.gff3.gz"
zcat gencode.v47.annotation.gff3.gz | grep "gene_name=SMN" > SMN.gff3
Get names of genotyped SMN alleles
grep "^SMN" genotype.txt | cut -f1 | tr "$\n" "," | sed 's/.$/\n/'
SMN_group1_GW00024_h2_152,SMN_group1_chr5_718,SMN_group1_chr5_719,SMN_group1_HG01175_h1_475
Otain all SMN annotation from database
grep "^SMN" PangenomeAlleles_annotationfix.v1.0.tsv > SMN_annotation.txt &
Plot
python Ctyper/tools/Plot/typemutant.py -i SMN_allannotations.txt (optional) -g SMN.gff3 -n SMN_group1_GW00024_h2_152,SMN_group1_chr5_718,SMN_group1_chr5_719,SMN_group1_HG01175_h1_475 -o SMN.png
example output of SMN mutant map:
Each row is each allele, and each vertically location is the consensus in multiple sequence alignments.
The gap is the deletion/missing sequences on each allele and each black dots is each variants on HG38.
Each color represents each allele-type we defined.
The location of genotyped alleles are slightly bolded in color (may need to zoom in sometimes) and highlighted with a red dot on the left.
(optional) The genecode gene and exons are highlight at the top.
Ctype is a command line tool to perform copy number sensitive versatile genotyping for NGS (Next-Generation Sequencing) data using pangenome database. It is designed to work on complex CNV (copy number variation) genes, but can also work as genotyping, local phasing or SV-calling tools for other genes.
The results will be represented as pangenome alleles, which is defined as genic segments with locally phased variants that are combinatorially heritable mostly range in 15-50 kb, about the size of "Haplotype blocks" or "LD-blocks". The pangenome alleles are further classified as allele-types among populations, which are highly similar subgroups, and can be used to represent complex genetic variations such as structrual variations, gene conversion, duplication, translocation and etc.
Ctyper is highly efficient, accurate and visualizable, thus allows high resolution large cohort association studies on complex CNV genes as well as complex genetic variations.
This repository includes the following components:
-
Ctyper (src/)
A C++ program for performing genotyping on NGS data. -
Auxiliary Analyze Tool (tools/Annotation/)
Python tools to analyze genotyping results.tools/Annotation/SampleAnnotate.py
: Convert the raw output file to more interpretable table with annotations.tools/Annotation/Nomenclature/GenotypetoNomenclature.py
: Output the public nomenclatures from genotyping results. The nomenclatures annotation can be found at Data/all_nomenclature.txt.tools/Annotation/VCF/GenotypetoVCF.py
: Converting the genotyping results to VCF format. This requires Individual Sample Annotation Database (eg. PangenomeAlleles_annotationfix.v1.0.tsv). Note: this not recommended to be used in association studies, because of the limitation of VCF file on representing pangenome, but maybe used for locating known important variants.
-
Visualization Tool (tools/Plot/)
A Python tool for visualizing genotyping results using multiple sequence alignments. -
Cohort Analysis Tool (tools/Cohort/)
A Python tool for summarizing results across all samples in a cohort study, including annotation information. -
Pangenome Allele Database
A database required by ctyper for genotyping. Files:HprcCpcHgsvc_final42_matrix.v1.0.txt
andHprcCpcHgsvc_final42_matrix.v1.0.txt.index
available at https://zenodo.org/records/13381931. -
Population Analysis Annotation Database
A database containing annotation information to support population-level analysis. File:PangenomeAlleles_typefix.v1.0.tsv
available at https://zenodo.org/records/13381931. -
Individual Sample Annotation Database
A comprehensive database with detailed annotations for individual pangenome alleles, supporting individual sample studies and visualization. File:PangenomeAlleles_annotationfix.v1.0.tsv
available at https://zenodo.org/records/13381931. -
Additional Data Files (data/)
backgrounds.list
: A list of k-mers used as backgrounds to determine NGS coverage if not predetermined.select_files.txt
: The catalog for all included genes and matrices, which can be used to locate the gene of interest.all_nomenclature.txt
: The public nomenclatures used by GenotypetoNomenclature.py, currently including HLAs, CYP2D6, and KIRs.- HLAs include: HLA-A, HLA-B, HLA-C, HLA-DMA, HLA-DMB, HLA-DOA, HLA-DOB, HLA-DPA1, HLA-DPA2, HLA-DPB1, HLA-DPB2, HLA-DQA1, HLA-DQA2, HLA-DQB1, HLA-DQB2, HLA-DRA, HLA-E, HLA-F, HLA-G, HLA-H, HLA-J, HLA-K, HLA-L, HLA-N, HLA-P, HLA-S, HLA-T, HLA-U, HLA-V, HLA-W, HLA-Y.
- KIRs include: KIR2DL1, KIR2DL2, KIR2DL3, KIR2DL4, KIR2DL5A, KIR2DL5B, KIR2DP1, KIR2DS1, KIR2DS2, KIR2DS3, KIR2DS4, KIR2DS5, KIR3DL1, KIR3DL2, KIR3DL3.
-
Test Cases (tests/)
Simple test cases for validating the tools and pipeline.
Ctyper is officially supported only in a Linux environment. Although it may run on UNIX-like system like MACOS as well, but we may not provide support for it.
You need RAM > 20G to run all genes at once. If you have a less RAM, you may split the whole database to smaller partitions. See in tools/Partition/.
- GCC 8
- Eigen Library
https://gitlab.com/libeigen/eigen - HTSlib
https://github.com/samtools/htslib
Alternatively, you can install the entire Samtools package:
https://github.com/samtools/samtools
For additional Python tools, you need:
- Python 3.7+
- NumPy
- Pandas
https://pandas.pydata.org/ - Matplotlib
https://matplotlib.org/stable/install/index.html
Ctyper takes five types of files as input:
- CRAM files (
*.cram
) — must be indexed (recommended due to lower I/O intensity). - BAM files (
*.bam
) — must be indexed. - SAM files (
*.sam
). - FASTQ files (
*.fastq
). - FASTA files (
*.fa
,*.fasta
).
-
Install all prerequisites.
-
Navigate to the
src/
directory:cd src/
-
Compile ctyper:
make
-
Move ctyper to your installation folder:
mv ctyper /path/to/your/install/folder
-
Download the pangenome allele database file (we'll refer to it as
$Database
) and its index file. -
(Optional) Download the allele-type annotation file.
-
(Optional) Download the full alleles annotation file.
-
You're ready to go!
Ctyper requires:
-
Input file(s)
-
Corresponding output file(s)
-
The pangenome allele database file (must be indexed)
- the database of pangenome allele for 3,351 CNV genes and 212 medically challenging genes. The information of those genes and their belonged matrices can be found at data/select_files.txt.
- If you prefer only run selected genes or you want to run in smaller RAM, you can use a tool called matrixpartion.py we included at tools/Partition
-
Sequencing coverage information, either by:
- Providing background k-mers for ctyper to determine sequencing coverage (recommended)
- Directly providing sequencing coverage information (useful when you don't have WGS data)
Ctyper can process:
Using background k-mers:
ctyper -i $Inputfile -m $Database -o $Outputfile -b background.list -n 1
Or providing sequencing coverage of 31-mers: 31-mers = read_coverage * (read_length - 30) / read_length, for example a NGS at read coverage = 30, its 31-mers coverage is 30 * (150-30)/150 = 24.
ctyper -i $Inputfile -m $Database -o $Outputfile -d $sequencing_coverage -n 1
If you only interested in genes at certain region, for example chr1:100-1000
samtools view -b input.bam chr1:100-1000 -f 4 -o subset.bam
samtools sort -o sorted_input.bam input.bam
samtools index sorted_input.bam
ctyper -i sorted_input.bam -m $Database -o $Outputfile -d $sequencing_coverage -n 1
Using background k-mers:
ctyper -I $AllInputs -m $Database -o $AllOutputs -b background.list -n $threads
Or providing sequencing coverage:
ctyper -I $AllInputs -m $Database -o $AllOutputs -D $All_sequencing_coverages -n $threads
- $AllInputs: A text file where each line is the path to an input file.
- $AllOutputs: A text file where each line is the output file corresponding to the input file (same line number).
- $All_sequencing_coverages: A text file where each line is the sequencing coverage information for the corresponding input file (same line number).
Supported parameters:
-
Help:
-h
: Print help information.
-
Inputs:
-i <string>
: Path to an individual input file.-I <string>
: Path to a file listing multiple input files (one per line).
-
Database:
-m <string>
: Path to the database file used for genotyping.-b <string>
: Path to the background k-mer list.
-
Coverage Information (cannot be used with
-b
):-d <float>
: Sequencing coverage of the input file.-D <string>
: Path to a file listing sequencing coverages (one per input file, corresponding by line number).
-
Outputs:
-o <string>
: Path to the individual output file.-O <string>
: Path to a file listing multiple output files (one per input file, corresponding by line number).
-
Multithreading:
-n <int>
: Number of threads to use (default is 1).
-
Bias Correction:
-c <bool>
: Perform bias correction for Illumina data.
Here shows the commands to make the genotyping results interpretable:
-
Summary the genotyping results into a table with annotation information
python tools/Annotation/SampleAnnotate.py -i $ctyper_outputs.txt -a PangenomeAlleles_annotationfix.tsv > genotype.txt
-
(optional) Obtain public nomenclatures for important genes,including HLA, CYP2D6, and KIR
python tools/Annotation/Nomenclature/GenotypetoNomenclature.py -i genotype.txt -a data/all_nomenclature.txt > nomenclature.txt
-
(optional) Convert genotyping results into VCF file
python tools/Annotation/VCF/GenotypetoVCF.py -i genotype.txt -o genotype.vcf
Visualization is performed on a gene-by-gene basis (not genome-wide).
For example, to visualize the gene AMY1A:
-
Identify the gene group for AMY1A:
cat data/select_files.txt | grep -w "AMY1A" | cut -f2,3
Output:
newGeneCopies/AMY/AMY_partitions/AMY_group1_AMY1BOOOAMY1COOOAMY1A.fa AMY1B,AMY1C,RNPC3,AMY1A,AMYP1,ACTG1P4,RP5-1108M17.5,AMY2B,AMY2A,
This shows that AMY1A is in AMY_group1, along with other amylase genes.
-
Extract the annotation for AMY_group1 from the full annotation table:
grep "^AMY_group1_" PangenomeAlleles_annotationfix.tsv > AMY_group1_annotationfix.tsv
-
Extract the genotyping results for AMY_group1:
grep "^result: AMY_group1_" genotype.txt
Output:
result: AMY_group1_GW00031_h1_556,AMY_group1_GW00051_h2_891,
The genotyping result is:
AMY_group1_GW00031_h1_556,AMY_group1_GW00051_h2_891
. -
Visualize the results:
python typemutant.py -i AMY_group1_annotationfix.tsv -n "AMY_group1_GW00031_h1_556,AMY_group1_GW00051_h2_891" -o output.png
Optional: To visualize the GENCODE annotation on the MSA:
-
Obtain the GENCODE annotation:
grep "gene_name=AMY" genecode.gff3 > AMY.gff3
-
Run the visualization with GENCODE annotation:
python typemutant.py -i AMY_group1_annotationfix.tsv -g AMY.gff3 -n "AMY_group1_GW00031_h1_556,AMY_group1_GW00051_h2_891," -o output.png
There are two scripts in the tools/Cohort
folder that work together for cohort analysis.
-
Download the allele-type annotation table:
PangenomeAlleles_typefix.tsv
-
Run
CountAllele.py
on each sample to get allele-type results:# For each result in $results for result in $results; do python CountAllele.py -i $result -t PangenomeAlleles_typefix.tsv -o "${result}_alleletype.out" done
Or, to run in parallel:
python CountAllele.py -f $results_folder/ -t PangenomeAlleles_typefix.tsv -n $numthreads
-
Summarize results into a single file and add annotations:
python SummaryAlleles.py -f $results_folder/ -t PangenomeAlleles_typefix.tsv -o cohort_results.tsv
Distributed under the MIT License. See LICENSE.txt
for more information.
Walfred MA - [email protected] Mark Chaisson - [email protected]