release SQL for gnomAD v3.1.2 (#18)

* Update config.yml * Update config.yml * Update environment.yaml * Update environment.yaml * Update environment.yaml * Create download_vcf_gnomad.sh * Create README.md * Update README.md * update columns and define vcf file size * Update README.md * Update setup.py Co-authored-by: Kalin Nonchev <[email protected]>
KalinNonchev · Jul 13, 2022 · db2bb5a · db2bb5a
1 parent cb00ce5
commit db2bb5a
Show file tree

Hide file tree

Showing 7 changed files with 112 additions and 55 deletions.
diff --git a/.circleci/config.yml b/.circleci/config.yml
@@ -18,17 +18,17 @@ workflows:
     # Inside the workflow, you define the jobs you want to run. 
     # For more details on extending your workflow, see the configuration docs: https://circleci.com/docs/2.0/configuration-reference/#workflows 
     jobs:
-      #- build-and-test
+      - build-and-test
       - test_pip_install
 
 
 jobs:
   build-and-test:  # This is the name of the job, feel free to change it to better match what you're trying to do!
     docker:
-      - image: continuumio/miniconda3:4.7.10
+      - image: continuumio/miniconda3
     steps:
       - checkout
-      - *update_conda
+      #- *update_conda
       - *create_env
       - run:
           name: Run tests

diff --git a/README.md b/README.md
@@ -1,6 +1,12 @@
 # gnomAD_DB
 
-### NEW version (December 2021)
+### Changelog
+
+#### NEW version (July 2022)
+- release gnomAD WGS v3.1.2
+- minor bug fixes
+
+#### version (December 2021)
 - more available variant features present, check [here](https://github.com/KalinNonchev/gnomAD_DB/blob/master/gnomad_db/pkgdata/gnomad_columns.yaml)
 - `get_maf_from_df` renamed to `get_info_from_df`
 - `get_maf_from_str` renamed to `get_info_from_str`
@@ -9,24 +15,24 @@
 
 [The Genome Aggregation Database (gnomAD)](https://gnomad.broadinstitute.org) is a resource developed by an international coalition of investigators, with the goal of aggregating and harmonizing both exome and genome sequencing data from a wide variety of large-scale sequencing projects, and making summary data available for the wider scientific community.
 
-This package scales the huge gnomAD files (on average ~120G/chrom) to a SQLite database with a size of 34G for WGS v2.1.1 (261.942.336 variants) and 99G for WGS v3.1.1 (about 759.302.267 variants), and allows scientists to look for various variant annotations present in gnomAD (i.e. Allele Count, Depth, Minor Allele Frequency, etc. - [here](https://github.com/KalinNonchev/gnomAD_DB/blob/master/gnomad_db/pkgdata/gnomad_columns.yaml) you can find all selected features given the genome version). (A query containing 300.000 variants takes ~40s.)
+This package scales the huge gnomAD files (on average ~120G/chrom) to a SQLite database with a size of 34G for WGS v2.1.1 (261.942.336 variants) and 98G for WGS v3.1.2 (about 759.302.267 variants), and allows scientists to look for various variant annotations present in gnomAD (i.e. Allele Count, Depth, Minor Allele Frequency, etc. - [here](https://github.com/KalinNonchev/gnomAD_DB/blob/master/gnomad_db/pkgdata/gnomad_columns.yaml) you can find all selected features given the genome version). (A query containing 300.000 variants takes ~40s.)
 
 It extracts from a gnomAD vcf about 23 variant annotations. You can find further infromation about the exact fields [here](https://github.com/KalinNonchev/gnomAD_DB/blob/master/gnomad_db/pkgdata/gnomad_columns.yaml). 
 
-###### The package works for all currently available gnomAD releases.(January 2022) 
+###### The package works for all currently available gnomAD releases.(July 2022) 
 
 ## 1. Download SQLite preprocessed files
 
-I have preprocessed and created sqlite3 files for gnomAD v2.1.1 and 3.1.1 for you, which can be easily downloaded from here. They contain all variants on the 24 standard chromosomes.
+I have preprocessed and created sqlite3 files for gnomAD v2.1.1 and 3.1.2 for you, which can be easily downloaded from here. They contain all variants on the 24 standard chromosomes.
 
-gnomAD v3.1.1 (hg38, **759'302'267** variants) 46.9G zipped, 99G in total - https://zenodo.org/record/5758663/files/gnomad_db_v3.1.1.sqlite3.gz?download=1 \
+gnomAD v3.1.2 (hg38, **759'302'267** variants) 46.2G zipped, 98G in total - https://zenodo.org/record/6818606/files/gnomad_db_v3.1.2.sqlite3.gz?download=1 \
 gnomAD v2.1.1 (hg19, **261'942'336** variants) 16.1G zipped, 48G in total - https://zenodo.org/record/5770384/files/gnomad_db_v2.1.1.sqlite3.gz?download=1
 
 You can download it as:
 
 ```python
 from gnomad_db.database import gnomAD_DB
-download_link = "https://zenodo.org/record/5770384/files/gnomad_db_v2.1.1.sqlite3.gz?download=1"
+download_link = "https://zenodo.org/record/6818606/files/gnomad_db_v3.1.2.sqlite3.gz?download=1"
 output_dir = "test_dir" # database_location
 gnomAD_DB.download_and_unzip(download_link, output_dir)
 ```
@@ -35,35 +41,6 @@ gnomAD_DB.download_and_unzip(download_link, output_dir)
 
 or you can create the database by yourself. **However, I recommend to use the preprocessed files to save ressources and time**. If you do so, you can go to **2. API usage** and explore the package and its great features!
 
-## 1.1 Data preprocessing and SQL database creation
-
-Start by downloading the vcf files from gnomAD in a single directory:
-
-```bash
-wget -c link_to_gnomAD.vcf.bgz
-```
-
-After that specify the arguments in the ```script_config.yaml```.
-```
-database_location: "test_out" # where to create the database, make sure you have space on your device.
-gnomad_vcf_location: "data" # where are your *.vcf.bgz located
-tables_location: "test_out" # where to store the preprocessed intermediate files, you can leave it like this 
-script_locations: "test_out" # where to store the scripts, where you can check the progress of your jobs, you can leave it like this
-genome: "Grch37" # genome version of the gnomAD vcf file (2.1.1 = Grch37, 3.1.1 = Grch38)
-```
-
-Once this is done, run
-```bash
-conda env create -f environment.yaml
-conda activate gnomad_db
-python -m ipykernel install --user --name gnomad_db --display-name "gnomad_db"
-```
-to prepare your conda environment
-
-Finally, you can trigger the snakemake pipeline which will create the SQL database
-```bash
-snakemake --cores 12
-```
 
 ## 2. API usage
 
@@ -82,7 +59,8 @@ import pandas as pd
 from gnomad_db.database import gnomAD_DB
 ```
 
-2. Initialize database connection
+2. Initialize database connection \
+**Make sure to have the correct genome version!**
 ```python
 # pass dir
 database_location = "test_dir"

diff --git a/environment.yaml b/environment.yaml
@@ -1,13 +1,13 @@
 name: gnomad_db
 
 channels:
-  - anaconda
   - conda-forge
   - bioconda
 
 dependencies:
-  - python>=3.7
+  - python>=3.6
   - pip
+  - datrie
   - numpy>=1.19
   - pandas>=1.1.4
   - pyyaml
@@ -22,4 +22,4 @@ dependencies:
       - joblib
       - pytest
       - nbformat>=5.1
-      - joblib
+      - joblib
diff --git a/gnomad_db/pkgdata/gnomad_columns.yaml b/gnomad_db/pkgdata/gnomad_columns.yaml
@@ -17,12 +17,12 @@ Grch37:
     - AC_popmax # Allele count in the population with the maximum AF
     - AN_popmax # Total number of alleles in the population with the maximum AF
     - AF_popmax # Maximum allele frequency across populations (excluding samples of Ashkenazi
-    - AF_eas
-    - AF_oth
-    - AF_nfe
-    - AF_fin
-    - AF_afr
-    - AF_asj
+    - AF_eas # Alternate allele frequency in samples of East Asian ancestry
+    - AF_oth # Alternate allele frequency in XY samples of Other ancestry
+    - AF_nfe # Alternate allele frequency in XY samples of Non-Finnish European ancestry
+    - AF_fin # Alternate allele frequency in XX samples of Finnish ancestry
+    - AF_afr # Alternate allele frequency in samples of African/African-American ancestry
+    - AF_asj # Alternate allele frequency in samples of Ashkenazi Jewish ancestry
 Grch38:
     - AC # Alternate allele count for samples
     - AN # Total number of alleles in samples
@@ -38,9 +38,9 @@ Grch38:
     - AC_popmax # Allele count in the population with the maximum AF
     - AN_popmax # Total number of alleles in the population with the maximum AF
     - AF_popmax # Maximum allele frequency across populations (excluding samples of Ashkenazi
-    - AF_eas
-    - AF_oth
-    - AF_nfe
-    - AF_fin
-    - AF_afr
-    - AF_asj
+    - AF_eas # Alternate allele frequency in samples of East Asian ancestry
+#    - AF_oth # Alternate allele frequency in XY samples of Other ancestry # not supported anymore 9.07.22
+    - AF_nfe # Alternate allele frequency in XY samples of Non-Finnish European ancestry
+    - AF_fin # Alternate allele frequency in XX samples of Finnish ancestry
+    - AF_afr # Alternate allele frequency in samples of African/African-American ancestry
+    - AF_asj # Alternate allele frequency in samples of Ashkenazi Jewish ancestry
diff --git a/scripts/README.md b/scripts/README.md
@@ -0,0 +1,29 @@
+## Data preprocessing and SQL database creation
+
+Start by downloading the vcf files from gnomAD in a single directory:
+
+```bash
+wget -c link_to_gnomAD.vcf.bgz
+```
+
+After that specify the arguments in the ```script_config.yaml```.
+```
+database_location: "test_out" # where to create the database, make sure you have space on your device.
+gnomad_vcf_location: "data" # where are your *.vcf.bgz located
+tables_location: "test_out" # where to store the preprocessed intermediate files, you can leave it like this 
+script_locations: "test_out" # where to store the scripts, where you can check the progress of your jobs, you can leave it like this
+genome: "Grch37" # genome version of the gnomAD vcf file (2.1.1 = Grch37, 3.1.1 = Grch38)
+```
+
+Once this is done, run
+```bash
+conda env create -f environment.yaml
+conda activate gnomad_db
+python -m ipykernel install --user --name gnomad_db --display-name "gnomad_db"
+```
+to prepare your conda environment
+
+Finally, you can trigger the snakemake pipeline which will create the SQL database
+```bash
+snakemake --cores 12
+```
diff --git a/scripts/download_vcf_gnomad.sh b/scripts/download_vcf_gnomad.sh
@@ -0,0 +1,50 @@
+### gnomAD v3.1.2 download VCF in parallel; total 2.3T
+# chr 1
+wget -c https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr1.vcf.bgz &
+# chr 2
+wget -c https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr2.vcf.bgz &
+# chr 3
+wget -c https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr3.vcf.bgz &
+# chr 4
+wget -c https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr4.vcf.bgz &
+# chr 5
+wget -c https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr5.vcf.bgz &
+# chr 6
+wget -c https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr6.vcf.bgz &
+# chr 7
+wget -c https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr7.vcf.bgz &
+# chr 8
+wget -c https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr8.vcf.bgz &
+# chr 9
+wget -c https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr9.vcf.bgz &
+# chr 10
+wget -c https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr10.vcf.bgz &
+# chr 11
+wget -c https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr11.vcf.bgz &
+# chr 12
+wget -c https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr12.vcf.bgz &
+# chr 13
+wget -c https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr13.vcf.bgz &
+# chr 14
+wget -c https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr14.vcf.bgz &
+# chr 15
+wget -c https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr15.vcf.bgz &
+# chr 16
+wget -c https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr16.vcf.bgz &
+# chr 17
+wget -c https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr17.vcf.bgz &
+# chr 18
+wget -c https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr18.vcf.bgz &
+# chr 19
+wget -c https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr19.vcf.bgz &
+# chr 20
+wget -c https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr20.vcf.bgz &
+# chr 21
+wget -c https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr21.vcf.bgz &
+# chr 22
+wget -c https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chr22.vcf.bgz &
+# chr x
+wget -c https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chrX.vcf.bgz &
+# xhr y
+wget -c https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chrY.vcf.bgz &
+wait
diff --git a/setup.py b/setup.py
@@ -1,7 +1,7 @@
 from setuptools import setup, find_packages
 
 setup(name='gnomad_db',
-      version='0.1.1',
+      version='0.1.2',
       description='This package scales the huge gnomAD files to a SQLite database, which is easy and fast to query. It extracts from a gnomAD vcf the minor allele frequency for each variant.',
       author='KalinNonchev',
       author_email='[email protected]',