This workflow infers the closest known reference allele, embeds/calls sample-specific variation within each gene, and infers novel allelic sequences via iterative graph construction and sequence-to-graph alignment. With a focus on poorly characterized immunoglobulin(Ig)/other adaptive immune receptor repertoire (AIRR)-related genes - BIgFOOT aims to identify AIRR loci/subgraphs/FOOTprints associated with the host immune responseusing widely available NGS data.
I plan to expand this workflow to enable genome-to-genome analyses/genetic association testing to interrogate the role of germline AIRR variation in immune-mediated diseases (including infectious disease).
Genetic loci where BIgFOOT performs accurate allele calling:
- HLA (DQA1/DQB1/... more to come)
Infers alleles - but, like bigoot, I have no evidence they're real (WiP):
- TR
- Raw fastq(.gz)
- BAM/CRAM alignment Note: you'll need ~65GB of RAM to sucessfully perform sequence-to-graph alignment against the full genome immunovariation graph
BIgFOOT is heavily influenced/relies on methods developed for VG-Flow (v0.0.4).
- Clone me!
git clone
- set up conda/mamba environment we'll be needing -- can move some of these after the '#' if they're already in your path (e.g., samtools, we assume you have R)
mamba create --name bigfoot -c bioconda -c conda-forge -c gurobi python=3 fastp graph-tool bazam minimap2 gurobi biopython numpy odgi gfaffix seqkit bbmap minimap2 seqwish blend-bio wfmash samtools pyseer unitig-caller parallel #fastq-dl kmc r-base cd-hit conda activate bigfoot
Ensure you have an active gurobi licence:
We also use the following R/bioconductor packages:
- data.table;dplyr;stringr
If sample-specific variant calling is desired:
- bcftools;tabix
- We also use some external tools which need to be accessible in your PATH
# (wherever you normally install+store software)
cd ${tools_dir};
bigfoot_source=${tools_dir}/bigfoot # where are we storing all of the reference graph files?
mkdir -p ${bigfoot_source}
wget -P ${bigfoot_source} ""
cd ${bigfoot_source} ; tar -xvf ${bigfoot_source}/immunovar_graph_materials.tar.gz* --keep-newer-files
Make distance indexes read only
chmod 0444 *.dist
We also need the variation graph toolkit (VG) executable
wget -P ${tools_dir}/; chmod +x ${tools_dir}/vg
We use Ryan Wick's Assembly-dereplicator package during haplotype selection Assembly-dereplicator.
git clone ${tools_dir}/Assembly-dereplicator
We provide the option of using merged paired-end reads from NGmerge for alignment/inference (optional, not always recommended) NGmerge.git clone ${tools_dir}/NGmerge
Running bigfoot - Example using sequencing/alignment files from ISGR: NA19240
Set up example directory, download relevant files, and then run BIgFOOT pipeline
conda activate bigfoot
(Change this if you've downloaded the github repo somewhere else/have the bigfoot analysis scripts saved elsewere)
bigfoot_dir=${tools_dir}/BIgFOOT/scripts ; immunovar_bed=${bigfoot_source}/grch38_custom_immunovar_coords.bed
test_dir=${bigfoot_source}/example/ ; mkdir -p ${test_dir}; cd ${test_dir}
Illumina chemistry: V2, Array: Agilent Sure Select Whole exome capture 50 Mb
#fastq-dl -a SRR507323 -o ${test_dir}/
wget -P ${test_dir}/
wget -P ${test_dir}/ sample="SRR507323" workdir=${PWD} bigfoot_source=${bigfoot_source} bigfoot_dir=${bigfoot_dir} merged="FALSE" graph="wg_immunovar" valid_alleles=true
################################################################ . ${bigfoot_dir}/
wget -P ${test_dir}/
tools_dir=${tools_dir} PATH=${tools_dir}:$PATHexport bam_file="" workdir=${PWD} bigfoot_source=${bigfoot_source} bigfoot_dir=${bigfoot_dir} ref_build="grch38" ref="${bigfoot_source}/GRCh38_full_analysis_set_plus_decoy_hla.fa" tools_dir=${tools_dir} PATH=${tools_dir}:$PATH merged="FALSE" graph="wg_immunovar" valid_alleles=true
################################################################ . ${bigfoot_dir}/ > ${bam_file%.cram}.log ################################################################
Support for CHM13-based BAM/CRAM is planned
bazam_reads=${i}; sample_id=${bazam_reads%.bazam.fastq.gz};sample_id=${sample_id##*/};
Sequence-to-graph alignment using VG-giraffevg giraffe -i -f ${bazam_reads} -x ${graph_base}.xg -H ${graph_base}.gbwt -d ${graph_base}.dist -m ${graph_base}.min -p > ${sample_id}.bazam.grch38.wg.gam
vg giraffe -f ${sample_id}.unmapped.fastq.gz -x ${graph_base}.xg -H ${graph_base}.gbwt -d ${graph_base}.dist -m ${graph_base}.min -p > ${sample_id}.unmapped.grch38.wg.gam
cat ${sample_id}.bazam.grch38.wg.gam ${sample_id}.unmapped.grch38.wg.gam > ${sample_id}.bazam.grch38.combined.gam
Ready for BIgFOOT
- export i=${sample_id}.bazam.grch38.combined.gam workdir=${PWD} graph=${graph} bigfoot_source=${bigfoot_source} bigfoot_dir=${bigfoot_dir} valid_alleles=true
################################################################ . ${bigfoot_dir}/ ################################################################
This is still very much a work in progress - many parameters/options exist but have not been fully documented here, and this repo is under active development.
Please reach out if you feel this tool might be useful in your work, you have questions regarding its use, or if you'd like some added functionality - you can open an issue or email