Skip to content

computational method to perform immunoglobulin allele-level genotyping and identification of sample specific variation via flow variation graphs.

License

Notifications You must be signed in to change notification settings

dduchen/BIgFOOT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BIgFOOT: Biomarkers of Immunovariation via Graph FOOTprinting

Current version: 0.1.0

This workflow infers the closest known reference allele, embeds/calls sample-specific variation within each gene, and infers novel allelic sequences via iterative graph construction and sequence-to-graph alignment. With a focus on poorly characterized immunoglobulin(Ig)/other adaptive immune receptor repertoire (AIRR)-related genes - BIgFOOT aims to identify AIRR loci/subgraphs/FOOTprints associated with the host immune responseusing widely available NGS data.
I plan to expand this workflow to enable genome-to-genome analyses/genetic association testing to interrogate the role of germline AIRR variation in immune-mediated diseases (including infectious disease).

Genetic loci where BIgFOOT performs accurate allele calling:

  • IGH
  • IGL
  • IGK
  • HLA (DQA1/DQB1/... more to come)

Infers alleles - but, like bigoot, I have no evidence they're real (WiP):

  • TR
  • KIR

Input:

  • Raw fastq(.gz)
  • BAM/CRAM alignment Note: you'll need ~65GB of RAM to sucessfully perform sequence-to-graph alignment against the full genome immunovariation graph

Set up conda environment

BIgFOOT is heavily influenced/relies on methods developed for VG-Flow (v0.0.4).

  1. Clone me! git clone https://github.com/dduchen/BIgFOOT.git
  2. set up conda/mamba environment we'll be needing -- can move some of these after the '#' if they're already in your path (e.g., samtools, we assume you have R)
    mamba create --name bigfoot -c bioconda -c conda-forge -c gurobi python=3 fastp graph-tool bazam minimap2 gurobi biopython numpy odgi gfaffix seqkit bbmap minimap2 seqwish blend-bio wfmash samtools pyseer unitig-caller parallel #fastq-dl kmc r-base cd-hit conda activate bigfoot
    Ensure you have an active gurobi licence:
    gurobi_cl
    We also use the following R/bioconductor packages:
    - data.table;dplyr;stringr
  • Biostrings/DECIPHER If sample-specific variant calling is desired:
    - bcftools;tabix
  1. We also use some external tools which need to be accessible in your PATH
    tools_dir=~/tools; # (wherever you normally install+store software)
    PATH=$PATH:${tools_dir};
    cd ${tools_dir};

Download BIgFOOT graph materials from zenodo DOI

bigfoot_source=${tools_dir}/bigfoot # where are we storing all of the reference graph files?
mkdir -p ${bigfoot_source}
wget -P ${bigfoot_source} "https://zenodo.org/records/10869771/files/immunovar_graph_materials.tar.gz?download=1"
cd ${bigfoot_source} ; tar -xvf ${bigfoot_source}/immunovar_graph_materials.tar.gz* --keep-newer-files
Make distance indexes read only
chmod 0444 *.dist
We also need the variation graph toolkit (VG) executable

We use Ryan Wick's Assembly-dereplicator package during haplotype selection Assembly-dereplicator.

Running bigfoot - Example using sequencing/alignment files from ISGR: NA19240

Yoruba in Ibadan, Nigeria, African Ancestry

Set up example directory, download relevant files, and then run BIgFOOT pipeline

  • conda activate bigfoot
    bigfoot_dir=${bigfoot_source}/scripts
    (Change this if you've downloaded the github repo somewhere else/have the bigfoot analysis scripts saved elsewere)
    bigfoot_dir=${tools_dir}/BIgFOOT/scripts ; immunovar_bed=${bigfoot_source}/grch38_custom_immunovar_coords.bed
    test_dir=${bigfoot_source}/example/ ; mkdir -p ${test_dir}; cd ${test_dir}

Starting from raw reads (WES)

Illumina chemistry: V2, Array: Agilent Sure Select Whole exome capture 50 Mb

  • #fastq-dl -a SRR507323 -o ${test_dir}/
    wget -P ${test_dir}/ ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR507/SRR507323/SRR507323_1.fastq.gz
    wget -P ${test_dir}/ ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR507/SRR507323/SRR507323_2.fastq.gz
  • export sample="SRR507323" workdir=${PWD} bigfoot_source=${bigfoot_source} bigfoot_dir=${bigfoot_dir} merged="FALSE" graph="wg_immunovar" valid_alleles=true
    ################################################################ . ${bigfoot_dir}/preprocess_wg_immunovar_alignment.sh
    ################################################################
Starting from BAM/CRAM (WGS)
  • wget -P ${test_dir}/ ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR398/ERR3989410/NA19240.final.cram
    tools_dir=${tools_dir} PATH=${tools_dir}:$PATH
  • export bam_file="NA19240.final.cram" workdir=${PWD} bigfoot_source=${bigfoot_source} bigfoot_dir=${bigfoot_dir} ref_build="grch38" ref="${bigfoot_source}/GRCh38_full_analysis_set_plus_decoy_hla.fa" tools_dir=${tools_dir} PATH=${tools_dir}:$PATH merged="FALSE" graph="wg_immunovar" valid_alleles=true
    ################################################################ . ${bigfoot_dir}/process_from_bam_wg_immunovar_alignment.sh > ${bam_file%.cram}.log ################################################################

    Support for CHM13-based BAM/CRAM is planned
Starting from subset of reads, some manual pre-processing
  • graphdir=${bigfoot_source};graph="wg_immunovar";graph_base=${graphdir}/whole_genome_ig_hla_kir_immunovar;immune_graph=${graph_base}".subgraph";
    bazam_reads=${i}; sample_id=${bazam_reads%.bazam.fastq.gz};sample_id=${sample_id##*/};

    Sequence-to-graph alignment using VG-giraffe
  • vg giraffe -i -f ${bazam_reads} -x ${graph_base}.xg -H ${graph_base}.gbwt -d ${graph_base}.dist -m ${graph_base}.min -p > ${sample_id}.bazam.grch38.wg.gam
  • vg giraffe -f ${sample_id}.unmapped.fastq.gz -x ${graph_base}.xg -H ${graph_base}.gbwt -d ${graph_base}.dist -m ${graph_base}.min -p > ${sample_id}.unmapped.grch38.wg.gam
    cat ${sample_id}.bazam.grch38.wg.gam ${sample_id}.unmapped.grch38.wg.gam > ${sample_id}.bazam.grch38.combined.gam

    Ready for BIgFOOT
    - export i=${sample_id}.bazam.grch38.combined.gam workdir=${PWD} graph=${graph} bigfoot_source=${bigfoot_source} bigfoot_dir=${bigfoot_dir} valid_alleles=true
    ################################################################ . ${bigfoot_dir}/filter_immune_subgraph.sh ################################################################

This is still very much a work in progress - many parameters/options exist but have not been fully documented here, and this repo is under active development.

Please reach out if you feel this tool might be useful in your work, you have questions regarding its use, or if you'd like some added functionality - you can open an issue or email

About

computational method to perform immunoglobulin allele-level genotyping and identification of sample specific variation via flow variation graphs.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published