BEASTIE (Bayesian Estimation of Allele-Specific Transcription Integrating across Exons) is a software suite for identifying allele-specific expression (ASE) from regulatory variants in RNA-seq and WGS data. BEASTIE uses a Bayesian hierarchical model QuickBEAST to integrate prior information with read count and genetic data.
The BEASTIE workflow is designed for gene-level ASE estimation and has been shown to be more accurate than other binomial distribution-based tests.
BEASTIE is free for academic and non-profit use.
The workflow is detailed in the 'Example Code Usage' section below.
If you prefer to run the QuickBEAST model only, visit the QuickBEAST Github page.
- sample.vcf.gz
- sample.bam.gz
- shapeit2 phasing file
- reference directory (gencode reference, Allele frequency reference extracted from 1000 Genome Project across different ancestries)
- optional testing data
Download testing data and/or reference files from shared google drive folder: Example testing data: "NA12878_chr21"
Zipped reference data: unzip and set the environment variable $refdir to the directory where reference folder has been unzipped.
https://drive.google.com/drive/folders/1z63jSyNWBKFJu4CZ4z54OFXHK8LsXu9O?usp=drive_link
Docker is the recommended way to run BEASTIE, as it avoids dependency conflicts and simplifies setup. Since Docker containers do not have access to the host system by default, you need to use the -v flag to mount your host directory within the container.
*Preparation step (get your unique LD token #, and replace it in the <config_file>: Register LD token
- Pull the Docker Image:
docker pull xuezou/beastie
- Run BEASTIE: Replace <config_file> with the path to your BEASTIE configuration file. This command assumes all your data and outputs are in the current directory.
% docker run -v "`pwd`":"`pwd`" xuezou/beastie -c <config_file>
If you're using a cluster with Singularity, follow these steps:
- Pull the Docker Image and Create a TAR File:
% docker save --output beastie.tar xuezou/beastie
-
Copy the TAR File to Your Cluster: Use 'scp' or another transfer method to copy the 'beastie.tar' file to your cluster.
-
Create a Singularity Image:
singularity build -s beastie.sif docker-archive://beastie.tar
- Run BEASTIE with Singularity:
% singularity run --bind <directory> beastie.sif -c <config_file>
If you prefer not to use Docker, you can install BEASTIE in a Python virtual environment.
- Clone the Repository:
git clone https://github.com/x811zou/BEASTIE.git
cd BEASTIE
- Create a Python Virtual Environment:
python3 -m venv venv
source venv/bin/activate
- Install Dependencies:
pip install -r requirements.txt
- Install BEASTIE
make install
This section provides step-by-step examples of running the BEASTIE pipeline using Docker.
sh run_BEASTIE.sh
- Clean your VCF file to retain only bi-allelic heterozygous sites.
docker run -v `pwd`:`pwd` -v $working_dir:/mnt xuezou/beastie
cleanVCF \
--tmp-dir $tmp_vcf_dir \
--vcfgz-file $input_vcfgz \
--filtered-bi-vcfgz $output_bi_vcfgz \
--filtered-bihets-vcfgz $output_bihets_vcfgz
NOTES:
1. when VCF hdoes not as PASS on quality score filtering result on VCF fields[7], use --skip-require-pass and --quality-score-min <10> to specify the cutoff
2. when VCF has PASS on quality score filtering result on VCF fields[7], in default --quality-score-min 0, --skip-require-pass is not needed
- Extract heterozygous sites from gencode reference for samtools mpileup (We provide pre-split gencode v19 for all 22 chromosome in reference folder, users are free to use their version of gencode reference and use vcftools tools to split it).
- Parse pileup read counts by our faster version python script originally adopted from ASEreadCounter.
- Thinning reads by read length. One read only counts once.
- Annotate AF and LD for bi-allelic het SNPs pairs
docker run -v `pwd`:`pwd` -v $working_dir:/mnt xuezou/beastie \
extractHets \
--vcfgz-file $output_bihets_vcfgz \
--out-hetSNP $output_hetSNP \
--chr-start $chr_start \
--chr-end $chr_end \
--gencode-dir $ref_dir/reference/gencode_chr
docker run -v `pwd`:`pwd` -v $working_dir:/mnt xuezou/beastie \
pileupReads \
--tmp-dir $tmp_mpileup_dir \
--in-hetSNP $output_hetSNP \
--in-BAMGZ $input_bamgz \
--out-pileupGZ $output_pileup_gz \
--chr-start $chr_start \
--chr-end $chr_end \
--ref $ref_dir/reference/hg19/hg19.fa
- pile up reads for each variant. The command used in this step:
% samtools mpileup -d 0 -B -s -f $ref -l $het_sites_for_mpileup $bam > ${prefix}.pileup
Simulate unbiased Fastq reads for each allele for each gene.
docker run -v `pwd`:`pwd` -v $working_dir:/mnt xuezou/beastie \
simulateReads \
--tmp-dir $tmp_simulation_dir \
--vcfgz-file $output_bi_vcfgz \
--in-BAMGZ $input_bamgz \
--out-fwd-fastq $output_fwd_fastqreads \
--out-rev-fastq $output_rev_fastqreads \
--chr-start $chr_start \
--chr-end $chr_end \
--ref-genome-2bit $ref_dir/reference/hg19/hg19.2bit \
--ref-gff $ref_dir/reference/gencode.v19.annotation.level12.gtf \
--ref-twobit $ref_dir/reference/two_bit_linux
Required action: align simulated fastq reads with the same alignment tool/parameters you used for your data. Then use step2 to do pileup for the aligned BAM for simulated reads.
Filter out heterozygous sites with genotyping errors.
docker run -v `pwd`:`pwd` -v $working_dir:/mnt xuezou/beastie \
filterGenotypingError \
--filtered-hetSNP-file $output_hetSNP_filtered \
--genotype-error-file $output_genotypeEr \
--vcfgz-file $output_bihets_vcfgz \
--pileupGZ-file $output_pileup_gz \
--input-hetSNP-file $output_hetSNP \
--ancestry $ancestry \
--read-length $read_length \
--chr-start $chr_start \
--chr-end $chr_end \
--af-dir $ref_dir/reference/AF \
--genotypeEr-cutoff $genotypeEr_cutoff \
--out-dir $tmp_genotypeEr_dir \
--warmup $WARMUP \
--keeper $KEEPER
- Convert data in format for model input
- Predict phasing error using trained logistic regression
- Update model input with predicted phasing error
- Run BEASTIE model
- Generate gene list with user-defined p-value cutoff
(1): Using Phasing from VCF:
docker run -v `pwd`:`pwd` -v $working_dir:/mnt xuezou/beastie \
runModel \
--vcfgz-file $bihets_vcfgz \
--vcf-sample-name $sample_name_in_vcf \
--filtered-hetSNP-file $output_hetSNP_filtered \
--ancestry $ancestry \
--chr-start $chr_start \
--chr-end $chr_end \
--min-single-cov $min_single_count \
--min-total-cov $min_total_count \
--read-length $read_length \
--output-dir $output_dir/beastie/phased_even${read_length}_VCF \
--ldlink-cache-dir $base_dir \
--save-intermediate \
--alignBiasP-cutoff $binomialp_cutoff \
--ase-cutoff $FDR_cutoff \
--ld-token $LD_token \
--simulation-pileup-file $input_simulation_pileup_gz
(2): Using Phasing from Shapeit2 Files:
docker run -v `pwd`:`pwd` -v $working_dir:/mnt xuezou/beastie \
runModel \
--vcfgz-file $bihets_vcfgz \
--vcf-sample-name $sample_name_in_vcf \
--filtered-hetSNP-file $out_hetSNP_filtered \
--ancestry $ancestry \
--chr-start $chr_start \
--chr-end $chr_end \
--min-single-cov $min_single_count \
--min-total-cov $min_total_count \
--read-length $read_length \
--output-dir $out_dir/beastie/phased_even${read_length}_shapeit2 \
--ldlink-cache-dir $base_dir \
--save-intermediate \
--alignBiasP-cutoff $binomialp_cutoff \
--ase-cutoff $FDR_cutoff \
--ld-token $LD_token \
--simulation-pileup-file $input_simulation_pileup_gz \
--shapeit2-phasing-file $input_shapeit2
a. trim raw RNAseq fastq reads using Trimmomatic
# install at $trimmomatic_path
wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.39.zip
% java -jar $trimmomatic_path/trimmomatic-0.33.jar PE -threads 16 -phred33 $fastq_R1 $fastq_R2 \
$trimmed_fastq/${sample}_FWD_paired.fq.gz $trimmed_fastq/${sample}_FWD_unpaired.fq.gz \
$trimmed_fastq/${sample}_REV_paired.fq.gz $trimmed_fastq/${sample}_REV_unpaired.fq.gz \
ILLUMINACLIP:$trimmomatic_reference/trimmomatic_MHPS.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
The parameters are:
- $trimmed_fastq: saving trimmed fastq output path
- $sample: sample name or prefix for output
b. align reads with STAR
We have done extensive comparison on RNAseq alignes reference allele mapping bias, and found that the best one with high efficiency and minimal bias is splice-aware aligner STAR with 2pass EndtoEnd alignment mode and WASP filtering : https://github.com/alexdobin/STAR. If you prefer to use aligned BAM files, you can directly use that as input.
% STAR --twopassMode Basic --runThreadN 24 --genomeDir $star_ind \
--readFilesIn $fastqDir/${sample}_FWD_paired.fq.gz $fastqDir/${sample}_REV_paired.fq.gz \
--alignEndsType EndToEnd \
--waspOutputMode SAMtag \
--varVCFfile $VCF \
--outFilterMismatchNmax 10 \
--outSAMtype BAM SortedByCoordinate \
--outReadsUnmapped Fastx \
--outSAMattributes NH HI NM MD AS nM jM jI XS vA vG vW \
--readFilesCommand "gunzip -c" \
--outFileNamePrefix $output_prefix
% java -jar $picardDir/picard.jar MarkDuplicates \
I=$output_prefix/Aligned.sortedByCoord.out.bam \
O=$output_prefix/Aligned.sortedByCoord.out.picard_markdup.bam
The parameters are:
- $fastqDir: input path for trimmed fastq
- $star_ind: index for STAR aligner
- $VCF: VCF file for sample. VCF file has 'chr' in the chromosome column.
- $sample: sample name or prefix for output
- $output_prefix: output path with output prefix for aligned BAM
Please cite this paper when using BEASTIE for your publications.
Zou, X., Gomez, Z. W., Reddy, T. E., Allen, A. S., Majoros, W. H. (2024). Bayesian Estimation of Allele-Specific
Expression in the Presence of Phasing Uncertainty. bioRxiv, doi: 10.1101/2024.08.09.607371.
The bioinformatics pipeline was developed by Scarlett (Xue Zou). We thank Bill for his guidance in this project, Zack and Tomas for their help in maintaining this software. For further information, inquiries, or contributions, please feel free to contact Xue Zou.