Skip to content

Latest commit

 

History

History
620 lines (420 loc) · 38.5 KB

README.md

File metadata and controls

620 lines (420 loc) · 38.5 KB

Introduction

This repository contains the course materials for hands-on exercise of the "Variants Annotation and Phenotype Analysis" session in the Quantitative Genomics workshop at Columbia University on June 24-25, 2024.

Preparation of computing environment

Both the lectures and hands-on exercises will be taught via Zoom video conference online. To ensure the cross-platform compatibility, we will only use Rstudio Cloud to implement tools that are developed in Perl and Python.

Students need to go to Posit Cloud to create your own account before the lectures:

  1. click 'Sign up' on the top-right corner image

  2. select "Free" and click "Sign up" at the right panel. image

  3. After you input your information, click "Sign up".

If you logout after signing up, you can go to RStudio Cloud and login with your username and password. After you are added to "2024 Quantitative Genomics" project, you can click "Annotation2024" to start the exercise. The direct link is provided in Slack channel.

image

When the student opens an assignment project, RStudio Cloud will automatically make a copy for them. In other words, each student will work on your own copy of the virtual machine without influencing each other's files.

By default, you will be in "Console" as shown below where you can run R code.

image

You can also click "Terminal" so that you can run linux commands.

image

During the exercise, you can feel free to swith to Terminal when you need to run linux commands and to Console when you need to run R code.

Basic Linux commands

After login to Rstudio Cloud, you already have terminal access by clicking 'terminal' button and can run all basic Linux commands. If you are not familiar with basic Linux commands, you can follow one simple tutorial to learn several commands: https://maker.pro/linux/tutorial/basic-linux-commands-for-beginners. Some of the commands that we will use in the exercise include ls, pwd, cd, mv and wget.

Functional annotation of genetic variants

1. Install ANNOVAR

Typically you will go to the ANNOVAR website, fill in a registration form, and download the package there. For this exercise, we already prepared a folder in the cloud server that contains a "compact" version of ANNOVAR and necessary library files, to make it easier for users. The folder is called genomics_exercise and you can simply use cd genomics_exercise to go to this folder). To confirm which directory you are in, you can type pwd. You will see the example below.

/cloud/project$ cd genomics_exercise/
/cloud/project/genomics_exercise$ pwd
/cloud/project/genomics_exercise

Please note that this is a sub-sampled version of ANNOVAR for the purpose of th exercise today (to reduce file size significantly to <200Mb), and it should not be used in any other real data analysis. You can view all the files in exercise1 directory:

image

2. Run ANNOVAR on a small VCF file

Type cd exercise1 to enter the exercise1 directory. The sub-folder humandb folder already contains several annotation databases for human genome that we will use in our exercise. (Note that users can find more annotation databases here.

perl table_annovar.pl example/ex2.vcf humandb/ -buildver hg19 -out myanno -remove -protocol refGeneWithVer,cytoBand,gnomad211_exome,dbnsfp47a -operation g,r,f,f -nastring . -vcfinput -polish

After that, you will find the result files myanno.hg19_multianno.txt and myanno.hg19_multianno.vcf.

The -protocol and the -operation arguments are the most important one to understand here. Protocol lists a set of tasks that you want to use for annotation, composed of a list of comma-separated keywords. Operation specifies the type of annotations that each step in the protocol uses, composed of a list of comma-separated letters such as g, r and f, corresponding to gene-based, region-based and filter-based annotations, respectively.

In the command above, we specified a protocol with three annotation tasks, including RefGene annotation (a gene-based annotation), cytogenetic band annotation (a region-based annotation), allele frequency in gnoMAD version 2.1.1 (a filter-based annotation), and functional prediction from dbNSFP database version 4.7 as filter-based annotation. These three types of annotation are indicated as g, r and f in the -operation argument, respectively. We also specify that dots ('.') be used to indicate lack of information, so variants that are not observed in the database will have '.' as the annotation. The -vcfinput argument specifies that the input file is in VCF format. We also specify that the output file names should have "myanno" as the prefix, to make it easier to check output files.

If you have Excel installed, you can open the hg19_multianno.txt file by Excel and examine the various tab-delimited fields. Since we are using Rstudio now, we could also easily import the tab-delimited file into Rstudio to examine the various rows and columns. We will do this later to filter and prioritize disease-relevant variants.

3. Run ANNOVAR on a human exome

Next, we want to download a VCF file and then run ANNOVAR on this file.

In the exercise1 folder: (you don't actually need to run this step during the course)

wget http://molecularcasestudies.cshlp.org/content/suppl/2016/10/11/mcs.a001131.DC1/Supp_File_2_KBG_family_Utah_VCF_files.zip -O Supp_File_2_KBG_family_Utah_VCF_files.zip

To give some background information, this is a zip file as supplementary material of a published paper on exome sequencing of a family with undiagnosed genetic diseases. Through analysis of the exome data, the proband was confirmed to have KBG syndrome, a disease caused by loss of function mutations in ANKRD11 gene. There are several VCF files contained in the zip file, including those for parents, silings and the proband. We will only analyze proband in this exercise, but if you are interested, you may want to check whether this is a de novo mutation by analyzing parental genomes.

Next unzip the ZIP file (unzip):

unzip Supp_File_2_KBG_family_Utah_VCF_files.zip
mv './File 2_KBG family Utah_VCF files/' VCF_files

Note that in the commands above, we rename the directory to "VCF_files" just to make it easier to read and operate.

Run ANNOVAR on the VCF file:

perl table_annovar.pl VCF_files/proband.vcf humandb/ -buildver hg19 -out proband.annovar -remove -protocol refGeneWithVer,gnomad211_exome,dbnsfp47a -operation g,f,f -nastring . -vcfinput

The proband.annovar.hg19_multianno.txt file contains annotations for this exome.

Now compare this command from the previous table_annovar.pl command, we can see that this type we only request a few annotation tasks (refGeneWithVer and gnomad211_exome and dbnsfp47a), which are gene-based, filter-based and filter-based annotations, respectively.

4. Results visualization

We used "terminal" for the commands above to generate results. To do further analysis in R, we can now switch to the "console" in Rstudio.

In the exercise1 folder, run pwd to obtain the working directory and copy this path.

Paste this path to R console or R studio interface to setup the working directory using function setwd(). For example:

setwd("/cloud/project/genomics_exercise/exercise1")

Check variant distribution across chromesomes:

#load data
res <- read.table("proband.annovar.hg19_multianno.txt", fill=T, header=T, sep="\t", na.strings = ".", quote="")

#visualize variant frequency
par(mar=c(5.1, 4.1, 4.1, 2.1),mfrow=c(1,1))
table <- table(res$Chr)
chrlist <- paste0("chr",1:22)
chrlist <- c(chrlist, "chrX")
barplot(table[chrlist], ylab="Number of Variant", las=2)

At this point, you should see a barplot similar to the one below:

image

Check allele frequency distribution between non-synonymous, synonymous and intronic variants:

#visualize allele frequency
par(mar=c(2, 4, 1, 1),mfrow=c(3,1))
af_syn<-res[res$ExonicFunc.refGeneWithVer=="synonymous SNV",]$AF
hist(as.numeric(af_syn),  ylab="Frequency", xlab="Allele frequency", main="Synonymous SNV", breaks=seq(0,1,.001))

af_nonsyn<-res[res$ExonicFunc.refGeneWithVer=="nonsynonymous SNV",]$AF
hist(as.numeric(af_nonsyn), ylab="Frequency", xlab="Allele frequency", main="nonSynonymous SNV",breaks=seq(0,1,.001))

af_intron<-res[res$Func.refGeneWithVer =="intronic",]$AF
hist(as.numeric(af_intron), ylab="Frequency", xlab="Allele frequency", main="Intronic", breaks=seq(0,1,.001))

You should see a figure similar to the one below:

image

Check allele frequency distribution across race:

#visualize allele frequency across race
par(mar = c(12, 5, 2, 2),mfrow=c(1,1)) 
res_sub <- res[res$Gene.refGeneWithVer=="NRXN2",]
variantname <- res_sub$Otherinfo6[1]
res_sub <- as.matrix(res_sub[1,16:23])
colnames(res_sub) <- c("African/African-American", "South Asian", 
                       "Latino/Admixed American", "East Asian", "Non-Finnish European", "Finnish",
                       "Ashkenazi Jewish", "Other")
barplot(res_sub, ylab="Allele frequency", main=variantname, las=2)

You should see a figure similar to the one below:

image

Next, we examine the allele frequency distributions stratified by SIFT scores, which predict the impact of amino acid substitutions on protein function based on sequence homology and the physical properties of amino acids. We expect variants associated with more deleterious effects (SIFT score < 0.05) to be rarer.

#visualize allele frequency stratified by SIFT scores
res_tolerate = res[res$SIFT_score > 0.05 & !is.na(res$SIFT_score),];dim(res_tolerate)
res_deleterious = res[res$SIFT_score<0.05 & !is.na(res$SIFT_score),];dim(res_deleterious)
par(mar=c(2.5, 4.5, 1.5, 1.5)+2,mfrow=c(1,1))
hist_deleterious <- hist(as.numeric(res_deleterious$AF), breaks=seq(0, 1, 0.001), plot = FALSE)
hist_tolerate <- hist(as.numeric(res_tolerate$AF), breaks=seq(0, 1, 0.001), plot = FALSE)
plot(hist_tolerate, col=rgb(0, 0, 1, 0.5), ylim=c(0, max(c(hist_deleterious$counts, hist_tolerate$counts))),
     xlab="Allele Frequency", ylab="Frequency", main="Allele Frequency Distribution", xlim=c(0, 1),
     border=NA)  # Set border to NA to remove outline
plot(hist_deleterious, col=rgb(1, 0, 0, 0.5), add=T, border=NA)
legend(x=0.6, y=115, legend=c("SIFT < 0.05", "SIFT > 0.05"), col=c("red", "blue"), 
       fill=c(rgb(1, 0, 0, 0.5), rgb(0, 0, 1, 0.5)),border=NA)

You should see a figure similar to the one below: image

We next use ggplot2 package to compare allele frequency distributions stratified by two other predictive scores: MetaRNN and AlphaMissense. We can similarly find lower allele frequencies for predicted pathogenic variants (MetaRNN: T(olerated) and D(amaging), AlphaMissense: likely (B)enign, (A)mbiguous, or likely (P)athogenic ).

library(ggplot2)
library(dplyr)
library(ggpubr)
a = ggplot(res%>%filter(is.na(MetaRNN_pred)==F),aes(x = MetaRNN_pred, y = AF))+
  geom_boxplot(width = 0.3)+theme_classic()+ggtitle('MetaRNN_pred')
b = ggplot(res%>%filter(is.na(AlphaMissense_pred)==F),aes(x = AlphaMissense_pred, y = AF))+
  geom_boxplot(width = 0.3)+theme_classic()+ggtitle('AlphaMissense_pred')
ggarrange(a,b,ncol=2)
image

Feel free to use the table browser to examine the various rows and columns in this table and perform additional summary statistics. Later on, after the exercise 2 (phenotype analysis) below, we will go back to this table to show how combined analysis of genotype data and phenotype data can facilitate genetic diagnosis of rare diseases.

Phenotype-driven prioritization of human disease genes (Phen2Gene, ClinPhen, AMELIE, PhenoSV, etc)

Phen2Gene is a phenotype-based gene prioritization tool from HPO IDs or clinical notes on patients. In the next exercise, we will first use Phen2Gene to prioritize genes based on clinical phenotypes of patients with Mendelian diseases.

We will do the exercise in a directory called exercise2. In the terminal, assuming that you are currently in the exercise1 directory (you can use pwd to check this), you can use command cd ../exercise2 to go into the exercise2 directory.

There are a few ways to run Phen2Gene: run it locally (need a few GB of space), use the Phen2Gene API and use Phen2Gene website.

The benefit of running Phen2Gene is if you do not have any idea of a candidate gene for your disease, you can use it in one of three scenarios:

  1. Ideally, you have a list of physician-curated HPO terms describing a patient phenotype and a list of potential candidate disease variants that overlap gene space and you want to narrow down the list of variants by prioritizing candidate disease genes, often in tandem with variant prioritization software, which cannot as of yet score STR expansions or SVs unlike Phen2Gene which is variant agnostic.
  2. You do not have variants, but you have HPO terms and would like to get some candidate genes for your disease that you may want to target sequence, as it is much cheaper than whole-genome or whole-exome sequencing.
  3. If you have clinical notes, you can use tools like EHR-Phenolyzer or Doc2HPO for processing clinical notes into HPO terms using natural language processing (NLP) techniques, then apply scenario 1 or 2 as relevant.

Other tools listed below (ClinPhen, AMELIE, etc) require a gene list, and Phen2Gene does not require any variant data or prior gene lists to provide high quality candidate genes. One would most likely use Phen2Gene for obtaining the genetic etiology of an undiagnosed rare disease with no obvious genetic cause.

1. Using Phen2Gene API

  1. Go to Terminal, make sure you are in the exercise2 directory first, and run curl -i -H "Accept: application/json" -H "Content-Type: application/json" "https://phen2gene.wglab.org/api?HPO_list=HP:0000455;HP:0000574;HP:0030084;HP:0012471;HP:0000239;HP:0001572;HP:0000960;HP:0001250;HP:0000322;HP:0001831" | tail -n 1 > output.txt where you generate JSON output in output.txt However, since the output.txt file is in JSON format, it is not very intuitive to view the content of the file. Instead, we will use the table browser in Rstudio to view a formatted version of the JSON file.
  2. Go To Console, remember that we are probably in the exercise1 directory, so we should first set exercise2 as the working directory.
setwd("../exercise2")
  1. Then we can run
# install JSON in R
install.packages("rjson")
# Load JSON in R
library("rjson")
# Read JSON results
result <- fromJSON(file = "output.txt")
# Convert them to array and name columns.
marray <- t(array( unlist(result$results), dim=c(5, length(result$results)) ) );
colnames(marray) <- names(result$results[[1]]);
# View the results in 2-D array. The second column is the rank of genes.
View (marray);
# Get top 100 genes based on Phen2Gene score
top100<-matrix(array(unlist(result$results)), 5)[1,][1:100]
write.table(top100, file="phen2gene_top100_genes", quote=F, col.names = F, row.names = F)

image

You can see that the top ranked genes are VPS13B, ARID1B, etc.

2. Running Phen2Gene locally in Terminal

Under the exercise2 directory, we can run python /cloud/project/usr/Phen2Gene/phen2gene.py -f hpo_list.txt where we generate Phen2Gene output file in out/output_file.associated_gene_list.

The hpo_list.txt file contains a list of HPO terms for a patient (one per line). If you want to see what is in the file, you can use cat hpo_list.txt to see the list of terms. The output file is written to out/output_file.associated_gene_list by default.

Use command more out/output_file.associated_gene_list to check the predicted genes.

image

You can see that the top ranked genes are VPS13B, ARID1B, etc. Type q to quit the mode.

3. Using the Phen2Gene Website to assess the case with KBG syndrome

Go to https://phen2gene.wglab.org. Click on the tab Patient notes:

image1

and paste this paragraph of clinical notes in the text box:

The proband had an abnormally large fontanelle, which resolved without treatment. The proband does not appear to have a sacral dimple. Other than the presence of flat arches, there are no obvious signs of foot abnormalities. The proband does not look like his other siblings, although there was a resemblance between him and his sister when they were the same age. Features of proband’s face can be seen, including bushy eyebrows, broad nasal tip, short philtrum, full lips and cupid bow of upper lip. Video of proband’s behavior. Proband is non-verbal, and hyperactive. He repetitively spins his toy. While playing, he gets up from his chair, walks a few steps, stomps his feet, and sits back down. Additional interview in August 2015, after the parents received the proband’s diagnosis of KBG syndrome. The proband develops new mannerisms every four to six months, the most recent being short, hard breaths through the nose and head turning. The proband has had a substantial decrease in the number of seizures after starting an Epidiolex (cannabidiol) treatment (70-80% decrease as described by the parents). The frequency of seizures increased after the proband fell and fractured his jaw.  The mother describes the proband’s macrodontia. Although the mother and several siblings have large teeth, the macrodontia in the proband does not appear in any other member of the family.  The proband’s features are compared to other characteristics usually found in other KBG patients. Unlike most KBG patients, the proband has full lips. Like most KBG patients, the proband has curved pinkies (diagnosed as clinodactyly), which are often found in KBG patients.  Although the proband has relatively short toes, this trait may have been inherited from the father. The proband also has curved toenails, which commonly appear in autistic children.

image2

Then click Submit.

image3

You should see that ANKRD11 is in the top 3. The reason it is not 2 as in the previous example is that we manually curated HPO terms from the patient notes for each patient in our Phen2Gene paper and this is raw notes with no curation done and all HPO terms are being automatically extracted by Doc2HPO. Despite this, the discrepancy in our results is minimal.

image4

Alternatively, you can also submit the HPO terms:

HP:0000707
HP:0007598
HP:0001156
HP:0012446
HP:0004209
HP:0000405
HP:0001627
HP:0002750
HP:0003235
HP:0000239
HP:0001572
HP:0002123
HP:0011220
HP:0004482
HP:0002069
HP:0012471
HP:0001566
HP:0001250

manually using the tab HPO IDs (they are already the default terms in the website so you can just click Submit on that tab).

image5

And you should see that ANKRD11 is still number 2.

image6

4. Run ClinPhen

ClinPhen is another tool to analyze clinical notes. To run this tool,

  1. Go to Terminal and run the commands below to install it.

export PATH="$HOME/.local/bin":$PATH

pip install clinphen
  1. Download package nltk in Python. In terminal, type python, then type
import nltk
nltk.download('omw-1.4')

After completing installation, type exit() to quit Python.

  1. Run ClinPhen
clinphen m_notes.txt

The expected results are below:

/cloud/project/genomics_exercise/exercise2$ clinphen m_notes.txt
HPO ID  Phenotype name  No. occurrences Earliness (lower = earlier)     Example sentence
HP:0012471      Thick vermilion border  2       12      unlike most kbg patients the proband has full lips 
HP:0000574      Thick eyebrow   1       9       features of proband s face can be seen including bushy eyebrows broad nasal tip short philtrum full lips and cupid bow of upper lip 
HP:0000455      Broad nasal tip 1       10      features of proband s face can be seen including bushy eyebrows broad nasal tip short philtrum full lips and cupid bow of upper lip 
HP:0000322      Short philtrum  1       11      features of proband s face can be seen including bushy eyebrows broad nasal tip short philtrum full lips and cupid bow of upper lip 
HP:0002263      Exaggerated cupid's bow 1       13      features of proband s face can be seen including bushy eyebrows broad nasal tip short philtrum full lips and cupid bow of upper lip 
HP:0001250      Seizures        1       32      the frequency of seizures increased after the proband fell and fractured his jaw 
HP:0030084      Clinodactyly    1       42      like most kbg patients the proband has curved pinkies diagnosed as clinodactyly which are often found in kbg patients 

You will get the results from ClinPhen. You can also type clinphen --help if you are interested in more options.

To find candidate genes for this set of HPO terms, you can input HP:0012471;HP:0000574;HP:0000455;HP:0000322;HP:0002263;HP:0001250;HP:0030084 into Phen2Gene web server.

5. Running AMELIE

AMELIE is yet another tool for analyzing clinical notes, however it requires a candidate gene list. We put the real causal gene in a algorithm-allowed maximum 1000 gene list of random exome capture genes.

First copy this list of HPO terms to your clipboard. Then go to the AMELIE website and click "Submit Case."

image

Now, paste the HPO term list into "Case phenotypes."

image

Then, copy this list of genes to your clipboard.

Next, paste the gene list into "Case genotype" under the tab "Gene List" then click Submit.

image

You'll get a screen that looks like this:

image

And a wheel that tells you to wait a while.

After the wheel is done spinning your results should look like the following:

image

The number 1 gene, TAF1 is the correct causal gene. But Phen2Gene gets the same result.

6. Running PhenCards

Go to https://phencards.org/. Click on the tab Patient notes:

image

and paste this paragraph of clinical notes in the text box:

The proband had an abnormally large fontanelle, which resolved without treatment. The proband does not appear to have a sacral dimple. Other than the presence of flat arches, there are no obvious signs of foot abnormalities. The proband does not look like his other siblings, although there was a resemblance between him and his sister when they were the same age. Features of proband’s face can be seen, including bushy eyebrows, broad nasal tip, short philtrum, full lips and cupid bow of upper lip. Video of proband’s behavior. Proband is non-verbal, and hyperactive. He repetitively spins his toy. While playing, he gets up from his chair, walks a few steps, stomps his feet, and sits back down. Additional interview in August 2015, after the parents received the proband’s diagnosis of KBG syndrome. The proband develops new mannerisms every four to six months, the most recent being short, hard breaths through the nose and head turning. The proband has had a substantial decrease in the number of seizures after starting an Epidiolex (cannabidiol) treatment (70-80% decrease as described by the parents). The frequency of seizures increased after the proband fell and fractured his jaw.  The mother describes the proband’s macrodontia. Although the mother and several siblings have large teeth, the macrodontia in the proband does not appear in any other member of the family.  The proband’s features are compared to other characteristics usually found in other KBG patients. Unlike most KBG patients, the proband has full lips. Like most KBG patients, the proband has curved pinkies (diagnosed as clinodactyly), which are often found in KBG patients.  Although the proband has relatively short toes, this trait may have been inherited from the father. The proband also has curved toenails, which commonly appear in autistic children.

image

Then click Search.

image

This is raw notes with no curation done and all HPO terms are being automatically extracted by Doc2HPO. You should see that KBG syndrome is among the top disease list.

image

Alternatively, you can also submit the following list of HPO terms manually using the PhenCards web server to find the information of HPO terms. (Do not enter these terms into PhenCards which only find cross-reference for each one of these terms.)

HP:0000455; HP:0000574; HP:0030084; HP:0012471; HP:0000239; HP:0001572; HP:0000960; HP:0001250; HP:0000322; HP:0001831

The expected results are shown below in Phen2Gene server:

image

7. Filtering ANNOVAR annotation results based on Phen2Gene output

Now that we performed variant annotation in exercise1, and performed phenotype-based gene prioritization in exercise2, we want to see how to infer and prioritize candidate genes for the disease.

Go to Console, we can run

# Load top 100 genes generated by Phen2Gene
top100 <- read.table("out/output_file.associated_gene_list", header=T)
top100 <- top100$Gene[1:100]

# Load results of ANNOVAR
res <- read.table("../exercise1/proband.annovar.hg19_multianno.txt", fill=T, header=T, sep="\t", na.strings = ".", quote="")

# Filtering ANNOVAR annotations based on Phen2Gene output
res$Gene.refGeneWithVer
res <- res[res$Gene.refGeneWithVer %in% top100,]
res <- res[res$Func.refGeneWithVer=="exonic",]
res <- res[res$ExonicFunc.refGeneWithVer!="synonymous SNV",]
res <- res[res$AF_popmax<.005 | is.na(res$AF_popmax),]
View(res)

The analytical logic in the above command is that: first we take only exonic variants in the top 100 genes predicted by phenotype-driven gene prioritization software. Next, we remove synonymous SNVs from the list. Next, we keep only variants that are not observed in the gnomAD database, or observed but with allele frequency lower than 0.005.

  1. You can see that gene ANKRD11 is in the following list

image

8. Running Phenomizer

Phenomizer is another a web-based application for clinical diagnostics in human genetics using semantic similarity searches in ontologies.

HP:0000455
HP:0000574
HP:0030084
HP:0012471
HP:0000239
HP:0001572
HP:0000960
HP:0001250
HP:0000322
HP:0001831

First copy one HPO term above to the search bar Phenomizer and click "Search".

image

Then right click the searched HPO term and select "Add to Patient's Features". You will see this HPO term added to the right panel of Phenomizer. Repeat this step for all HPO terms listed above.

image

Select "Yes" to make sure 'symmetric' mode algorithm checked on.

image

After all HPO terms added to the right panel, click "Get diagnosis" at the bottom right corner of Phenomizer.

image

You should see that KBG syndrome is among the top disease list

image

9. Running OARD

OARD (An acronym form of "oh alright") is an Open real-world based Annotation for Rare Diseases and its associated phenotypes.

This is a React web app to serve the web app of OARD. The backend is provided by OARD API. Currently it is hosted on the NCATS AWS server (https://rare.cohd.io/).

An overview of OARD project

open annotation for rare diseases (OARD), a real-world-data-derived resource with annotation for rare-disease-related phenotypes. This resource is derived from the EHRs of two academic health institutions containing more than 10 million individuals spanning wide age ranges and different disease subgroups. By leveraging ontology mapping and advanced natural-language-processing (NLP) methods, OARD automatically and efficiently extracts concepts for both rare diseases and their phenotypic traits from billing codes and lab tests as well as over 100 million clinical narratives.

1. Using OARD web app

Go to https://rare.cohd.io/.

image

Select Method as mostFrequency and Domain as Disease to view most frequently occurred disease concept in a dataset:

image

Then click Submit:

image

Type HP:0012461 into Query Concept List 1 and select Method as singleConceptFreq and Dataset as 1-all to view single concept frequency in a dataset:

image

Then click Submit:

image

Type HP:0012461 into Query Concept List 1 and HP:0500111 into Query Concept List 2 and select Method as pairConceptFreq and Dataset as 1-all to view pair concept frequency in a dataset:

image

2. Using OARD API

  1. Go to Terminal, make sure you are in the exercise2 directory first, and run

curl "https://rare.cohd.io/api/frequencies/mostFrequency?dataset_id=2&domain_id=diseases" > output_oard1.txt

curl "https://rare.cohd.io/api/frequencies/singleConceptFreq?dataset_id=1&concept_id=90012461" > output_oard2.txt

curl "https://rare.cohd.io/api/frequencies/pairedConceptFreq?dataset_id=1&concept_id_1=90012461&concept_id_2=90500111" > output_oard3.txt

where you generate JSON outputs. However, since the output.txt file is in JSON format, it is not very intuitive to view the content of the file. Instead, we will use the table browser in Rstudio to view a formatted version of the JSON file.

  1. Go To Console, remember that we are probably in the exercise1 directory, so we should first set exercise2 as the working directory.
setwd("../exercise2")
  1. Then we can run
library("rjson")
# Read JSON results
result1 <- fromJSON(file = "output_oard1.txt")
# Convert them to array and name columns.
marray1 <- t(array( unlist(result1$results), dim=c(7, length(result1$results)) ) );
colnames(marray1) <- names(result1$results[[1]]);
# View the results in 2-D array.
View (marray1);

image

result2 <- fromJSON(file = "output_oard2.txt")
# Convert them to array and name columns.
marray2 <- t(array( unlist(result2$results), dim=c(7, length(result2$results)) ) );
colnames(marray2) <- names(result2$results[[1]]);
# View the results in 2-D array.
View (marray2);

image

result3 <- fromJSON(file = "output_oard3.txt")
# Convert them to array and name columns.
marray3 <- t(array( unlist(result3$results), dim=c(14, length(result3$results)) ) );
colnames(marray3) <- names(result3$results[[1]]);
# View the results in 2-D array.
View (marray3);

image

11. Run PhenoSV

PhenoSV is an interpretable phenotype-aware model to prioritize genes affected by structural variants (SVs). PhenoSV can predict pathogenicity of all types of SVs that disrupt either coding or noncoding genome regions, including deletions, duplications, insertions, inversions, and translocations. When phenotype information is available, PhenoSV further utilizes gene-phenotype associations (Phen2Gene) to prioritize disease-related SVs.

For small SVs, we can use the web server (PhenoSV.) to score and interpret SVs. PhenoSV also offers a command-line tool. In the next exercise, we will use PhenoSV to score a single SV with patient's clinical phenotypes.

We need to install database files for PhenoSV to operate. Note that the databases are large and may require more than 15 minutes to complete the installation.

#download PhenoSV database
cd /var/tmp
wget https://www.openbioinformatics.org/PhenoSV/PhenosvlightFile.tar
tar -xvf "PhenosvlightFile.tar"
rm "PhenosvlightFile.tar"
wget https://github.com/WGLab/Phen2Gene/releases/download/1.1.0/H2GKBs.zip
unzip -q "H2GKBs.zip"
rm "H2GKBs.zip"
cd "/cloud/project/genomics_exercise/exercise3/PhenoSV"
bash update_config.sh /var/tmp

After installing the database, We can use PhenoSV. In terminal, use command:

python3 phenosv/model/phenosv.py --c chr6 --s 156994830 --e 157006982 --svtype 'deletion' --noncoding 'tad' --HPO 'HP:0000707,HP:0007598' --model 'PhenoSV-light'

  Elements  Pathogenicity           Type  Phen2Gene   PhenoSV
0       SV       0.607864  Non-coding SV   0.999126  0.607332
1   ARID1B       0.550785       Intronic   0.999126  0.550304
2     NOX3       0.155716     Regulatory   0.837460  0.130406
3    TFB1M       0.249238     Regulatory   0.544762  0.135775

12. Summary of the phenotype analysis exercises

In summary, a number of computational tools such as Phen2Gene, AMELIE and GADO can perform phenotype-driven gene prioritization. Phen2Gene provides webserver or API, or you can install and run locally (which is important to deploy it in batch processing mode), and it does not require a list of random genes to run either.

CITATIONS:

  • ANNOVAR: Wang K, Li M, Hakonarson H. ANNOVAR: Functional annotation of genetic variants from next-generation sequencing data. Nucleic Acids Research, 38:e164, 2010
  • Phen2Gene: Zhao, M. et al. Phen2Gene: rapid phenotype-driven gene prioritization for rare diseases. NAR Genom Bioinform, 2:lqaa032 (2020).
  • AMELIE: Birgmeier, J. et al. AMELIE speeds Mendelian diagnosis by matching patient phenotype and genotype to primary literature. Sci. Transl. Med. 12:eaau9113, (2020).
  • GADO: Deelen, P. et al. Improving the diagnostic yield of exome- sequencing by predicting gene-phenotype associations using large-scale gene expression analysis. Nat. Commun. 10:2837 (2019).
  • ClinPhen: Deisseroth, C. A. et al. ClinPhen extracts and prioritizes patient phenotypes directly from medical records to expedite genetic disease diagnosis. Genet. Med. 21:1585–1593 (2019).
  • PhenCards: Havrilla, J.M. et al. PhenCards: a data resource linking human phenotype information to biomedical knowledge. Genom. Med. 13:91 (2021)
  • OARD: Liu, C. et al. OARD: Open annotations for rare diseases and their phenotypes based on real-world data. Am. J. Hum. Genet. 109(9): 1591–1604, (2022)
  • PhenoSV: Xu Z, Li Q, Marchionni L and Wang K. PhenoSV: interpretable phenotype-aware model for the prioritization of genes affected by structural variants. Nat. Commun. 14:7805 (2023)