Update README.md

KalinNonchev · May 9, 2021 · b0b800f · b0b800f
1 parent 7657013
commit b0b800f
Showing 1 changed file with 27 additions and 10 deletions.
diff --git a/README.md b/README.md
@@ -1,11 +1,11 @@
 # gnomAD_MAF
-The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators, with the goal of aggregating and harmonizing both exome and genome sequencing data from a wide variety of large-scale sequencing projects, and making summary data available for the wider scientific community.
+[The Genome Aggregation Database (gnomAD)](https://gnomad.broadinstitute.org) is a resource developed by an international coalition of investigators, with the goal of aggregating and harmonizing both exome and genome sequencing data from a wide variety of large-scale sequencing projects, and making summary data available for the wider scientific community.
 
-This package scales the huge gnomAD files (on average ~120G/chrom) to a SQLite database with a size of 56G for WGS v3.1.1. 
+This package scales the huge gnomAD files (on average ~120G/chrom) to a SQLite database with a size of 56G for WGS v3.1.1 (about 760.000.000 variants), and allows scientists to look for minor allele frequencies of variants really fast (A query containing 300.000 variants takes ~40s.)
 
 It extracts from a gnomAD vcf the ["AF", "AF_afr", "AF_eas", "AF_fin", "AF_nfe", "AF_asj", "AF_oth", "AF_popmax"] columns. 
 
-It works for all currently available gnomAD releases. 
+###### The package works for all currently available gnomAD releases.(2021) 
 
 ## 1. Data preprocessing and SQL database creation
 
@@ -23,7 +23,7 @@ tables_location: "test_out" # where to store the preprocessed intermediate files
 script_locations: "test_out" # where to store the scripts, where you can check the progress of your jobs, you can leave it like this
 ```
 
-Ones, this is done, run
+Once this is done, run
 ```bash
 conda env create -f environment.yaml
 conda activate gnomad_db
@@ -46,24 +46,30 @@ pip install gnomad_db
 ```
 
 You can use the package like
+
+1. import modules
 ```python
 import pandas as pd
 from gnomad_db.database import gnomAD_DB
+```
 
+2. Initialize database connection
+```python
 # pass dir
 database_location = "test_dir"
 db = gnomAD_DB(database_location)
+```
 
+3. Insert some test variants to run the examples below
+```python
 # get some variants
 var_df = pd.read_csv("data/test_vcf_gnomad_chr21_10000.tsv.gz", sep="\t", names=db.columns, index_col=False)
-# preprocess missing values
 # IMPORTANT: The database removes internally chr prefix (chr1->1)
-var_df = var_df.replace(".", np.NaN)
-
 # insert these variants
 db.insert_variants(var_df)
 ```
 
+4. Query variant minor allele frequency
 ```python
 # query some MAF scores
 dummy_var_df = pd.DataFrame({
@@ -72,13 +78,24 @@ dummy_var_df = pd.DataFrame({
     "ref": ["T", "C"], 
     "alt": ["G", "T"]})
 
-# query from dataframe
-db.get_maf_from_df(dummy_var_df, "AF").head()
+# query from dataframe AF column
+db.get_maf_from_df(dummy_var_df, "AF")
+
+# query from dataframe AF and AF_popmax columns
+db.get_maf_from_df(dummy_var_df, "AF, AF_popmax")
+
+# query from dataframe all columns
+db.get_maf_from_df(dummy_var_df, "*")
 
 # query from string
 db.get_maf_from_str("21:9825790:C>T", "AF")
+```
 
-
+5. You can query also intervals of minor allele frequencies
+```python
+db.get_mafs_for_interval(chrom=21, interval_start=9825780, interval_end=9825799, query="AF")
 ```
 
 For more information on how to use the package, look into GettingStartedwithGnomAD_DB.ipynb notebook!
+
+#### NB: The package is under development and any use cases suggestions/extensions and feedback are welcome.