BIgFOOT: Biomarkers of Immunovariation via Graph FOOTprinting

Current version: 0.1.0

This workflow infers the closest known reference allele, embeds/calls sample-specific variation within each gene, and infers novel allelic sequences via iterative graph construction and sequence-to-graph alignment. With a focus on poorly characterized immunoglobulin(Ig)/other adaptive immune receptor repertoire (AIRR)-related genes - BIgFOOT aims to identify AIRR loci/subgraphs/FOOTprints associated with the host immune responseusing widely available NGS data.
I plan to expand this workflow to enable genome-to-genome analyses/genetic association testing to interrogate the role of germline AIRR variation in immune-mediated diseases (including infectious disease).

Genetic loci where BIgFOOT performs accurate allele calling:

IGH
IGL
IGK
HLA (DQA1/DQB1/... more to come)

Infers alleles - but, like bigoot, I have no evidence they're real (WiP):

TR
KIR

Input:

Raw fastq(.gz)
BAM/CRAM alignment Note: you'll need ~65GB of RAM to sucessfully perform sequence-to-graph alignment against the full genome immunovariation graph

Set up conda environment

BIgFOOT is heavily influenced/relies on methods developed for VG-Flow (v0.0.4).

Clone me! git clone https://github.com/dduchen/BIgFOOT.git
set up conda/mamba environment we'll be needing -- can move some of these after the '#' if they're already in your path (e.g., samtools, we assume you have R)
mamba create --name bigfoot -c bioconda -c conda-forge -c gurobi python=3 fastp graph-tool bazam minimap2 gurobi biopython numpy odgi gfaffix seqkit bbmap minimap2 seqwish blend-bio wfmash samtools pyseer unitig-caller parallel #fastq-dl kmc r-base cd-hit conda activate bigfoot
Ensure you have an active gurobi licence:
gurobi_cl
We also use the following R/bioconductor packages:
- data.table;dplyr;stringr

Biostrings/DECIPHER If sample-specific variant calling is desired:
- bcftools;tabix

We also use some external tools which need to be accessible in your PATH
tools_dir=~/tools; # (wherever you normally install+store software)
PATH=$PATH:${tools_dir};
cd ${tools_dir};

Download BIgFOOT graph materials from zenodo

bigfoot_source=${tools_dir}/bigfoot # where are we storing all of the reference graph files?
mkdir -p ${bigfoot_source}
wget -P ${bigfoot_source} "https://zenodo.org/records/10869771/files/immunovar_graph_materials.tar.gz?download=1"
cd ${bigfoot_source} ; tar -xvf ${bigfoot_source}/immunovar_graph_materials.tar.gz* --keep-newer-files
Make distance indexes read only
chmod 0444 *.dist
We also need the variation graph toolkit (VG) executable

wget -P ${tools_dir}/ https://github.com/vgteam/vg/releases/download/v1.56.0/vg; chmod +x ${tools_dir}/vg PATH=${tools_dir}:$PATH

We use Ryan Wick's Assembly-dereplicator package during haplotype selection Assembly-dereplicator.

git clone https://github.com/dduchen/Assembly-Dereplicator.git ${tools_dir}/Assembly-dereplicator
We provide the option of using merged paired-end reads from NGmerge for alignment/inference (optional, not always recommended) NGmerge.
git clone https://github.com/dduchen/NGmerge.git ${tools_dir}/NGmerge

Running bigfoot - Example using sequencing/alignment files from ISGR: NA19240

Yoruba in Ibadan, Nigeria, African Ancestry

Set up example directory, download relevant files, and then run BIgFOOT pipeline

conda activate bigfoot bigfoot_dir=${bigfoot_source}/scripts (Change this if you've downloaded the github repo somewhere else/have the bigfoot analysis scripts saved elsewere)
bigfoot_dir=${tools_dir}/BIgFOOT/scripts ; immunovar_bed=${bigfoot_source}/grch38_custom_immunovar_coords.bed test_dir=${bigfoot_source}/example/ ; mkdir -p ${test_dir}; cd ${test_dir}

Starting from raw reads (WES)

Illumina chemistry: V2, Array: Agilent Sure Select Whole exome capture 50 Mb

#fastq-dl -a SRR507323 -o ${test_dir}/ wget -P ${test_dir}/ ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR507/SRR507323/SRR507323_1.fastq.gz wget -P ${test_dir}/ ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR507/SRR507323/SRR507323_2.fastq.gz

export sample="SRR507323" workdir=${PWD} bigfoot_source=${bigfoot_source} bigfoot_dir=${bigfoot_dir} merged="FALSE" graph="wg_immunovar" valid_alleles=true ################################################################ . ${bigfoot_dir}/preprocess_wg_immunovar_alignment.sh ################################################################

Starting from BAM/CRAM (WGS)

wget -P ${test_dir}/ ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR398/ERR3989410/NA19240.final.cram tools_dir=${tools_dir} PATH=${tools_dir}:$PATH

export bam_file="NA19240.final.cram" workdir=${PWD} bigfoot_source=${bigfoot_source} bigfoot_dir=${bigfoot_dir} ref_build="grch38" ref="${bigfoot_source}/GRCh38_full_analysis_set_plus_decoy_hla.fa" tools_dir=${tools_dir} PATH=${tools_dir}:$PATH merged="FALSE" graph="wg_immunovar" valid_alleles=true ################################################################ . ${bigfoot_dir}/process_from_bam_wg_immunovar_alignment.sh > ${bam_file%.cram}.log ################################################################
Support for CHM13-based BAM/CRAM is planned

Starting from subset of reads, some manual pre-processing

graphdir=${bigfoot_source};graph="wg_immunovar";graph_base=${graphdir}/whole_genome_ig_hla_kir_immunovar;immune_graph=${graph_base}".subgraph"; bazam_reads=${i}; sample_id=${bazam_reads%.bazam.fastq.gz};sample_id=${sample_id##*/};
Sequence-to-graph alignment using VG-giraffe
vg giraffe -i -f ${bazam_reads} -x ${graph_base}.xg -H ${graph_base}.gbwt -d ${graph_base}.dist -m ${graph_base}.min -p > ${sample_id}.bazam.grch38.wg.gam
vg giraffe -f ${sample_id}.unmapped.fastq.gz -x ${graph_base}.xg -H ${graph_base}.gbwt -d ${graph_base}.dist -m ${graph_base}.min -p > ${sample_id}.unmapped.grch38.wg.gam cat ${sample_id}.bazam.grch38.wg.gam ${sample_id}.unmapped.grch38.wg.gam > ${sample_id}.bazam.grch38.combined.gam
Ready for BIgFOOT
- export i=${sample_id}.bazam.grch38.combined.gam workdir=${PWD} graph=${graph} bigfoot_source=${bigfoot_source} bigfoot_dir=${bigfoot_dir} valid_alleles=true ################################################################ . ${bigfoot_dir}/filter_immune_subgraph.sh ################################################################

This is still very much a work in progress - many parameters/options exist but have not been fully documented here, and this repo is under active development.

Please reach out if you feel this tool might be useful in your work, you have questions regarding its use, or if you'd like some added functionality - you can open an issue or email

Name		Name	Last commit message	Last commit date
Latest commit History 131 Commits
custom_beds		custom_beds
scripts		scripts
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
gurobi.log		gurobi.log

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BIgFOOT: Biomarkers of Immunovariation via Graph FOOTprinting

Current version: 0.1.0

Input:

Set up conda environment

Download BIgFOOT graph materials from zenodo

Running bigfoot - Example using sequencing/alignment files from ISGR: NA19240

Yoruba in Ibadan, Nigeria, African Ancestry

Starting from raw reads (WES)

Starting from BAM/CRAM (WGS)

Starting from subset of reads, some manual pre-processing

About

Releases

Packages

Languages

License

dduchen/BIgFOOT

Folders and files

Latest commit

History

Repository files navigation

BIgFOOT: Biomarkers of Immunovariation via Graph FOOTprinting

Current version: 0.1.0

Input:

Set up conda environment

Download BIgFOOT graph materials from zenodo

Running bigfoot - Example using sequencing/alignment files from ISGR: NA19240

Yoruba in Ibadan, Nigeria, African Ancestry

Starting from raw reads (WES)

Starting from BAM/CRAM (WGS)

Starting from subset of reads, some manual pre-processing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages