Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
KalinNonchev authored May 9, 2021
1 parent 7657013 commit b0b800f
Showing 1 changed file with 27 additions and 10 deletions.
37 changes: 27 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# gnomAD_MAF
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators, with the goal of aggregating and harmonizing both exome and genome sequencing data from a wide variety of large-scale sequencing projects, and making summary data available for the wider scientific community.
[The Genome Aggregation Database (gnomAD)](https://gnomad.broadinstitute.org) is a resource developed by an international coalition of investigators, with the goal of aggregating and harmonizing both exome and genome sequencing data from a wide variety of large-scale sequencing projects, and making summary data available for the wider scientific community.

This package scales the huge gnomAD files (on average ~120G/chrom) to a SQLite database with a size of 56G for WGS v3.1.1.
This package scales the huge gnomAD files (on average ~120G/chrom) to a SQLite database with a size of 56G for WGS v3.1.1 (about 760.000.000 variants), and allows scientists to look for minor allele frequencies of variants really fast (A query containing 300.000 variants takes ~40s.)

It extracts from a gnomAD vcf the ["AF", "AF_afr", "AF_eas", "AF_fin", "AF_nfe", "AF_asj", "AF_oth", "AF_popmax"] columns.

It works for all currently available gnomAD releases.
###### The package works for all currently available gnomAD releases.(2021)

## 1. Data preprocessing and SQL database creation

Expand All @@ -23,7 +23,7 @@ tables_location: "test_out" # where to store the preprocessed intermediate files
script_locations: "test_out" # where to store the scripts, where you can check the progress of your jobs, you can leave it like this
```

Ones, this is done, run
Once this is done, run
```bash
conda env create -f environment.yaml
conda activate gnomad_db
Expand All @@ -46,24 +46,30 @@ pip install gnomad_db
```

You can use the package like

1. import modules
```python
import pandas as pd
from gnomad_db.database import gnomAD_DB
```

2. Initialize database connection
```python
# pass dir
database_location = "test_dir"
db = gnomAD_DB(database_location)
```

3. Insert some test variants to run the examples below
```python
# get some variants
var_df = pd.read_csv("data/test_vcf_gnomad_chr21_10000.tsv.gz", sep="\t", names=db.columns, index_col=False)
# preprocess missing values
# IMPORTANT: The database removes internally chr prefix (chr1->1)
var_df = var_df.replace(".", np.NaN)

# insert these variants
db.insert_variants(var_df)
```

4. Query variant minor allele frequency
```python
# query some MAF scores
dummy_var_df = pd.DataFrame({
Expand All @@ -72,13 +78,24 @@ dummy_var_df = pd.DataFrame({
"ref": ["T", "C"],
"alt": ["G", "T"]})

# query from dataframe
db.get_maf_from_df(dummy_var_df, "AF").head()
# query from dataframe AF column
db.get_maf_from_df(dummy_var_df, "AF")

# query from dataframe AF and AF_popmax columns
db.get_maf_from_df(dummy_var_df, "AF, AF_popmax")

# query from dataframe all columns
db.get_maf_from_df(dummy_var_df, "*")

# query from string
db.get_maf_from_str("21:9825790:C>T", "AF")
```


5. You can query also intervals of minor allele frequencies
```python
db.get_mafs_for_interval(chrom=21, interval_start=9825780, interval_end=9825799, query="AF")
```

For more information on how to use the package, look into GettingStartedwithGnomAD_DB.ipynb notebook!

#### NB: The package is under development and any use cases suggestions/extensions and feedback are welcome.

0 comments on commit b0b800f

Please sign in to comment.