Skip to content

x811zou/BEASTIE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BEASTIE: A bioinformatics Pipeline for Gene-level ASE Estimation

BEASTIE (Bayesian Estimation of Allele-Specific Transcription Integrating across Exons) is a software suite for identifying allele-specific expression (ASE) from regulatory variants in RNA-seq and WGS data. BEASTIE uses a Bayesian hierarchical model QuickBEAST to integrate prior information with read count and genetic data.

The BEASTIE workflow is designed for gene-level ASE estimation and has been shown to be more accurate than other binomial distribution-based tests.

BEASTIE is free for academic and non-profit use.

Summary of workflow

alt text

The workflow is detailed in the 'Example Code Usage' section below.

QuickBEAST Model

If you prefer to run the QuickBEAST model only, visit the QuickBEAST Github page.


Input Preparation & Testing data

  • sample.vcf.gz
  • sample.bam.gz
  • shapeit2 phasing file
  • reference directory (gencode reference, Allele frequency reference extracted from 1000 Genome Project across different ancestries)
  • optional testing data
    Download testing data and/or reference files from shared google drive folder: Example testing data: "NA12878_chr21"
    Zipped reference data: unzip and set the environment variable $refdir to the directory where reference folder has been unzipped.
https://drive.google.com/drive/folders/1z63jSyNWBKFJu4CZ4z54OFXHK8LsXu9O?usp=drive_link

Installation options


Using Docker Image locally

Docker is the recommended way to run BEASTIE, as it avoids dependency conflicts and simplifies setup. Since Docker containers do not have access to the host system by default, you need to use the -v flag to mount your host directory within the container.

*Preparation step (get your unique LD token #, and replace it in the <config_file>: Register LD token

  1. Pull the Docker Image:
docker pull xuezou/beastie
  1. Run BEASTIE: Replace <config_file> with the path to your BEASTIE configuration file. This command assumes all your data and outputs are in the current directory.
% docker run -v "`pwd`":"`pwd`" xuezou/beastie -c <config_file>

Using Singularity on a cluster:

If you're using a cluster with Singularity, follow these steps:

  1. Pull the Docker Image and Create a TAR File:
% docker save --output beastie.tar xuezou/beastie
  1. Copy the TAR File to Your Cluster: Use 'scp' or another transfer method to copy the 'beastie.tar' file to your cluster.

  2. Create a Singularity Image:

singularity build -s beastie.sif docker-archive://beastie.tar
  1. Run BEASTIE with Singularity:
% singularity run --bind <directory> beastie.sif -c <config_file>

Installation on Local System (Without Docker)

If you prefer not to use Docker, you can install BEASTIE in a Python virtual environment.

  1. Clone the Repository:
git clone https://github.com/x811zou/BEASTIE.git
cd BEASTIE
  1. Create a Python Virtual Environment:
python3 -m venv venv
source venv/bin/activate
  1. Install Dependencies:
pip install -r requirements.txt
  1. Install BEASTIE
make install

Example code usage (details in the run_BEASTIE.sh file):

This section provides step-by-step examples of running the BEASTIE pipeline using Docker.

sh run_BEASTIE.sh

step0: Clean VCF file.

  • Clean your VCF file to retain only bi-allelic heterozygous sites.
docker run  -v `pwd`:`pwd` -v $working_dir:/mnt xuezou/beastie
    cleanVCF \
    --tmp-dir $tmp_vcf_dir \
    --vcfgz-file $input_vcfgz \
    --filtered-bi-vcfgz $output_bi_vcfgz \
    --filtered-bihets-vcfgz $output_bihets_vcfgz

NOTES:
1. when VCF hdoes not as PASS on quality score filtering result on VCF fields[7], use --skip-require-pass and --quality-score-min <10> to specify the cutoff
2. when VCF has PASS on quality score filtering result on VCF fields[7], in default --quality-score-min 0, --skip-require-pass is not needed

step1: Extract Heterozygous sites from gencode reference on VCF file.

  • Extract heterozygous sites from gencode reference for samtools mpileup (We provide pre-split gencode v19 for all 22 chromosome in reference folder, users are free to use their version of gencode reference and use vcftools tools to split it).
  • Parse pileup read counts by our faster version python script originally adopted from ASEreadCounter.
  • Thinning reads by read length. One read only counts once.
  • Annotate AF and LD for bi-allelic het SNPs pairs
docker run -v `pwd`:`pwd` -v $working_dir:/mnt xuezou/beastie \
    extractHets \
    --vcfgz-file $output_bihets_vcfgz \
    --out-hetSNP $output_hetSNP \
    --chr-start $chr_start \
    --chr-end $chr_end \
    --gencode-dir $ref_dir/reference/gencode_chr 

step2: Parse Gene Coverage from BAM

docker run -v `pwd`:`pwd` -v $working_dir:/mnt xuezou/beastie \
    pileupReads \
    --tmp-dir $tmp_mpileup_dir \
    --in-hetSNP $output_hetSNP \
    --in-BAMGZ $input_bamgz \
    --out-pileupGZ $output_pileup_gz \
    --chr-start $chr_start \
    --chr-end $chr_end \
    --ref $ref_dir/reference/hg19/hg19.fa
  • pile up reads for each variant. The command used in this step:
% samtools mpileup -d 0 -B -s -f $ref -l $het_sites_for_mpileup $bam > ${prefix}.pileup

step3: Simulate unbiased fastq reads on each allele for each gene.

Simulate unbiased Fastq reads for each allele for each gene.

docker run -v `pwd`:`pwd` -v $working_dir:/mnt xuezou/beastie \
    simulateReads \
    --tmp-dir $tmp_simulation_dir \
    --vcfgz-file $output_bi_vcfgz \
    --in-BAMGZ $input_bamgz \
    --out-fwd-fastq $output_fwd_fastqreads \
    --out-rev-fastq $output_rev_fastqreads \
    --chr-start $chr_start \
    --chr-end $chr_end \
    --ref-genome-2bit $ref_dir/reference/hg19/hg19.2bit \
    --ref-gff $ref_dir/reference/gencode.v19.annotation.level12.gtf \
    --ref-twobit $ref_dir/reference/two_bit_linux

Required action: align simulated fastq reads with the same alignment tool/parameters you used for your data. Then use step2 to do pileup for the aligned BAM for simulated reads.

step4: Filter Genotyping Errors

Filter out heterozygous sites with genotyping errors.

docker run -v `pwd`:`pwd` -v $working_dir:/mnt xuezou/beastie \
    filterGenotypingError \
    --filtered-hetSNP-file $output_hetSNP_filtered \
    --genotype-error-file $output_genotypeEr \
    --vcfgz-file $output_bihets_vcfgz \
    --pileupGZ-file $output_pileup_gz \
    --input-hetSNP-file $output_hetSNP \
    --ancestry $ancestry \
    --read-length $read_length \
    --chr-start $chr_start \
    --chr-end $chr_end \
    --af-dir $ref_dir/reference/AF \
    --genotypeEr-cutoff $genotypeEr_cutoff \
    --out-dir $tmp_genotypeEr_dir \
    --warmup $WARMUP \
    --keeper $KEEPER 

step5: run BEASTIE

  • Convert data in format for model input
  • Predict phasing error using trained logistic regression
  • Update model input with predicted phasing error
  • Run BEASTIE model
  • Generate gene list with user-defined p-value cutoff

(1): Using Phasing from VCF:

docker run -v `pwd`:`pwd` -v $working_dir:/mnt xuezou/beastie \
    runModel \
    --vcfgz-file $bihets_vcfgz \
    --vcf-sample-name $sample_name_in_vcf \
    --filtered-hetSNP-file $output_hetSNP_filtered \
    --ancestry $ancestry \
    --chr-start $chr_start \
    --chr-end $chr_end \
    --min-single-cov $min_single_count \
    --min-total-cov $min_total_count \
    --read-length $read_length \
    --output-dir $output_dir/beastie/phased_even${read_length}_VCF \
    --ldlink-cache-dir $base_dir \
    --save-intermediate \
    --alignBiasP-cutoff $binomialp_cutoff \
    --ase-cutoff $FDR_cutoff \
    --ld-token $LD_token \
    --simulation-pileup-file $input_simulation_pileup_gz

(2): Using Phasing from Shapeit2 Files:

docker run  -v `pwd`:`pwd` -v $working_dir:/mnt xuezou/beastie \
    runModel \
    --vcfgz-file $bihets_vcfgz \
    --vcf-sample-name $sample_name_in_vcf \
    --filtered-hetSNP-file $out_hetSNP_filtered \
    --ancestry $ancestry \
    --chr-start $chr_start \
    --chr-end $chr_end \
    --min-single-cov $min_single_count \
    --min-total-cov $min_total_count \
    --read-length $read_length \
    --output-dir $out_dir/beastie/phased_even${read_length}_shapeit2 \
    --ldlink-cache-dir $base_dir \
    --save-intermediate \
    --alignBiasP-cutoff $binomialp_cutoff \
    --ase-cutoff $FDR_cutoff \
    --ld-token $LD_token \
    --simulation-pileup-file $input_simulation_pileup_gz \
    --shapeit2-phasing-file $input_shapeit2

Recommended preparation step: process raw fastq data

a. trim raw RNAseq fastq reads using Trimmomatic

# install at $trimmomatic_path
wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.39.zip

% java -jar $trimmomatic_path/trimmomatic-0.33.jar PE -threads 16 -phred33 $fastq_R1 $fastq_R2 \
     $trimmed_fastq/${sample}_FWD_paired.fq.gz $trimmed_fastq/${sample}_FWD_unpaired.fq.gz \
     $trimmed_fastq/${sample}_REV_paired.fq.gz $trimmed_fastq/${sample}_REV_unpaired.fq.gz \
     ILLUMINACLIP:$trimmomatic_reference/trimmomatic_MHPS.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

The parameters are:

  • $trimmed_fastq: saving trimmed fastq output path
  • $sample: sample name or prefix for output

b. align reads with STAR
We have done extensive comparison on RNAseq alignes reference allele mapping bias, and found that the best one with high efficiency and minimal bias is splice-aware aligner STAR with 2pass EndtoEnd alignment mode and WASP filtering : https://github.com/alexdobin/STAR. If you prefer to use aligned BAM files, you can directly use that as input.

% STAR --twopassMode Basic --runThreadN 24 --genomeDir $star_ind \
      --readFilesIn $fastqDir/${sample}_FWD_paired.fq.gz $fastqDir/${sample}_REV_paired.fq.gz \
      --alignEndsType EndToEnd \
      --waspOutputMode SAMtag \
      --varVCFfile $VCF \
      --outFilterMismatchNmax 10 \
      --outSAMtype BAM SortedByCoordinate \
      --outReadsUnmapped Fastx \
      --outSAMattributes NH HI NM MD AS nM jM jI XS vA vG vW \
      --readFilesCommand "gunzip -c" \
      --outFileNamePrefix $output_prefix

% java -jar $picardDir/picard.jar MarkDuplicates \
     I=$output_prefix/Aligned.sortedByCoord.out.bam \
     O=$output_prefix/Aligned.sortedByCoord.out.picard_markdup.bam

The parameters are:

  • $fastqDir: input path for trimmed fastq
  • $star_ind: index for STAR aligner
  • $VCF: VCF file for sample. VCF file has 'chr' in the chromosome column.
  • $sample: sample name or prefix for output
  • $output_prefix: output path with output prefix for aligned BAM

Citing:

Please cite this paper when using BEASTIE for your publications.

Zou, X., Gomez, Z. W., Reddy, T. E., Allen, A. S., Majoros, W. H. (2024). Bayesian Estimation of Allele-Specific
Expression in the Presence of Phasing Uncertainty. bioRxiv, doi: 10.1101/2024.08.09.607371.

Acknowledgements:

The bioinformatics pipeline was developed by Scarlett (Xue Zou). We thank Bill for his guidance in this project, Zack and Tomas for their help in maintaining this software. For further information, inquiries, or contributions, please feel free to contact Xue Zou.

About

A bioinformatics pipeline for gene level ASE estimation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published