all scripts are located in snp-calling
directory.
- Input files were downloaded from CyVerse using
getFiles.sh
script. The files that were large (ran without barcodes), were split to smaller chunks for quick processing usingsplit-fastq.sh
script. - The genome file was downloaded from CyVerse (B73.v5) and processed using
gatk-prepare-reference.sh
to create all the necessary files for running GATK pipeline. - The fastq files were mapped to B73.v5 and processed using
process-fastq.sh
script. Briefly, this script:- converts unmapped fastq to bam
FastqToSam
- runs Picard
MarkIlluminaAdapters
- converts bam back to fastq
SamToFastq
- Maps fastq files to B73.v5 using
bwa mem
and converts to bam file usingsamtools
- merged unmapped reads with mapped reads using
MergeBamAlignment
- runs picard's
MarkDuplicates
to mark optical duplicates.
- converts unmapped fastq to bam
- As a final step of processing, using
run-add-readgroups.sh
correct read groups were added to the bam files and indexed. - GATK was run on 1Mb intervals, using the script
gatkcmds-round-1.sh
and the intervals fileB73.PLATINUM.pseudomolecules-v1_1mb_coords.bed
, the commands were generated and was run on the cluster creating slurm job submission script using GNU parallel. - Once the VCF files were generated (2,813 total), they were gathered and processed to filter and retain very high quality SNPs only, using the script
gatk-process.sh
- The bam files were recalibrated using the filtered first round SNP files using
gatk-bsqr.sh
- Using the recalibrated BAM files, GATK was ran again on 1Mb intervals, using the script
gatkcmds-round-2.sh
and the intervals fileB73.PLATINUM.pseudomolecules-v1_1mb_coords.bed
, the commands were generated and was run on the cluster creating slurm job submission script using GNU parallel. - The final files were filtered again using the
gatk-process.sh
script again. - The final files were uploaded to CyVerse