Skip to content

Commit

Permalink
added option --print-protein and --print-protein-pretty
Browse files Browse the repository at this point in the history
  • Loading branch information
Wanding Zhou committed Apr 27, 2016
1 parent ed5b837 commit 8834d01
Show file tree
Hide file tree
Showing 13 changed files with 330 additions and 115 deletions.
41 changes: 2 additions & 39 deletions docs/source/annotation_from_genomic_level.rst
Original file line number Diff line number Diff line change
Expand Up @@ -359,43 +359,6 @@ A block-substitution that is in-frame,
CSQN=Missense;codon_cDNA=508-509-510;source=CCDS


Inspect variant protein sequence
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The `--print-protein` and `--print-protein-pretty` options displays the full variant protein sequence in the `variant_protein_seq` field of the info when the genomic variant hits a protein-coding transcript.

.. code:: bash
$ transvar ganno -i 'chr1:g.115256530G>A' --ensembl --print-protein
::


`--print-protein-pretty` output is more human-readable and highlight the mutation in brackets.

.. code:: bash
$ transvar ganno --ccds -i 'chr3:g.178936091G>A' --print-protein-pretty
::

The alphabet transformation option `--aa3` applies here as well.

.. code:: bash
$ transvar ganno -i 'chr1:g.115256530G>A' --ensembl --print-protein-pretty --aa3
::

To inspect the protein sequence after a deletion,

.. code:: bash
$ transvar canno --ccds -i 'CCDS8856:c.769_771delGGG' --print-protein-pretty
::

Promoter region
##################

Expand Down Expand Up @@ -497,8 +460,8 @@ output a splice variation

::

chr7:5568790A>G CCDS5341 (protein_coding) ACTB -
chr7:g.5568790A>G/c.363+2T>C/. inside_[intron_between_exon_2_and_3]
chr7:5568790A>G CCDS5341 (protein_coding) ACTB -
chr7:g.5568790A>G/c.363+2T>C/. inside_[intron_between_exon_2_and_3]
CSQN=SpliceDonorSNV;C2=SpliceDonorOfExon2_At_chr7:5568791;source=CCDS


Expand Down
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ Contents:
annotation_from_protein_level
annotation_from_cdna_level
interpret_variant_consequence
inspect_variants
faq
features
license
Expand Down
165 changes: 165 additions & 0 deletions docs/source/inspect_variants.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
***************************
Inspect variant sequences
***************************

The `--print-protein` and `--print-protein-pretty` options displays the full variant protein sequence in the `variant_protein_seq` field of the info when the genomic variant hits a protein-coding transcript.

Missense substitution
#######################

.. code:: bash
$ transvar ganno -i 'chr1:g.115256530G>A' --ensembl --print-protein
::

chr1:g.115256530G>A ENST00000369535 (protein_coding) NRAS -
chr1:g.115256530G>A/c.181C>T/p.Q61* inside_[cds_in_exon_3]
CSQN=Nonsense;variant_protein_seq=MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQ
VVIDGETCLLDILDTAG*;codon_pos=115256528-115256529-115256530;ref_codon_seq=CAA;
aliases=ENSP00000358548;source=Ensembl

`--print-protein-pretty` output is more human-readable and highlight the mutation in brackets.

.. code:: bash
$ transvar ganno --ccds -i 'chr3:g.178936091G>A' --print-protein-pretty
::

chr3:g.178936091G>A CCDS43171 (protein_coding) PIK3CA +
chr3:g.178936091G>A/c.1633G>A/p.E545K inside_[cds_in_exon_9]
CSQN=Missense;dbsnp=rs104886003(chr3:178936091G>A);variant_protein_seq=MPPRPS
SGELWGIHLMPPRILVECLLPNGMIVTLECLREATLITIKHELFKEARKYPLHQLLQDESSYIFVSVTQEAEREEFF
DETRRLCDLRLFQPFLKVIEPVGNREEKILNREIGFAIGMPVCEFDMVKDPEVQDFRRNILNVCKEAVDLRDLNSPH
SRAMYVYPPNVESSPELPKHIYNKLDKGQIIVVIWVIVSPNNDKQKYTLKINHDCVPEQVIAEAIRKKTRSMLLSSE
QLKLCVLEYQGKYILKVCGCDEYFLEKYPLSQYKYIRSCIMLGRMPNLMLMAKESLYSQLPMDCFTMPSYSRRISTA
TPYMNGETSTKSLWVINSALRIKILCATYVNVNIRDIDKIYVRTGIYHGGEPLCDNVNTQRVPCSNPRWNEWLNYDI
YIPDLPRAARLCLSICSVKGRKGAKEEHCPLAWGNINLFDYTDTLVSGKMALNLWPVPHGLEDLLNPIGVTGSNPNK
ETPCLELEFDWFSSVVKFPDMSVIEEHANWSVSREAGFSYSHAGLSNRLARDNELRENDKEQLKAISTRDPLSEIT_
_[E>K]__QEKDFLWSHRHYCVTIPEILPKLLLSVKWNSRDEVAQMYCLVKDWPPIKPEQAMELLDCNYPDPMVRGF
AVRCLEKYLTDDKLSQYLIQLVQVLKYEQYLDNLLVRFLLKKALTNQRIGHFFFWHLKSEMHNKTVSQRFGLLLESY
CRACGMYLKHLNRQVEAMEKLINLTDILKQEKKDETQKVQMKFLVEQMRRPDFMDALQGFLSPLNPAHQLGNLRLEE
CRIMSSAKRPLWLNWENPDIMSELLFQNNEIIFKNGDDLRQDMLTLQIIRIMENIWQNQGLDLRMLPYGCLSIGDCV
GLIEVVRNSHTIMQIQCKGGLKGALQFNSHTLHQWLKDKNKGEIYDAAIDLFTRSCAGYCVATFILGIGDRHNSNIM
VKDDGQLFHIDFGHFLDHKKKKFGYKRERVPFVLTQDFLIVISKGAQECTKTREFERFQEMCYKAYLAIRQHANLFI
NLFSMMLGSGMPELQSFDDIAYIRKTLALDKTEQEALEYFMKQMNDAHHGGWTTKMDWIFHTIKQHALN*;codon_
pos=178936091-178936092-178936093;ref_codon_seq=GAG;source=CCDS


The alphabet transformation option `--aa3` applies here as well.

.. code:: bash
$ transvar ganno -i 'chr1:g.115256530G>A' --ensembl --print-protein-pretty --aa3
::

chr1:g.115256530G>A ENST00000369535 (protein_coding) NRAS -
chr1:g.115256530G>A/c.181C>T/p.Gln61X inside_[cds_in_exon_3]
CSQN=Missense;variant_protein_seq=MetThrGluTyrLysLeuValValValGlyAlaGlyGlyValG
lyLysSerAlaLeuThrIleGlnLeuIleGlnAsnHisPheValAspGluTyrAspProThrIleGluAspSerTyr
ArgLysGlnValValIleAspGlyGluThrCysLeuLeuAspIleLeuAspThrAlaGly__[GluGluTyrSerAl
aMetArgAspGlnTyrMetArgThrGlyGluGlyPheLeuCysValPheAlaIleAsnAsnSerLysSerPheAlaA
spIleAsnLeuTyrArgGluGlnIleLysArgValLysAspSerAspAspValProMetValLeuValGlyAsnLys
CysAspLeuProThrArgThrValAspThrLysGlnAlaHisGluLeuAlaLysSerTyrGlyIleProPheIleGl
uThrSerAlaLysThrArgGlnGlyValGluAspAlaPheTyrThrLeuValArgGluIleArgGlnTyrArgMetL
ysLysLeuAsnSerSerAspAspGlyThrGlnGlyCysMetGlyLeuProCysValValMet>X];codon_pos=1
15256528-115256529-115256530;ref_codon_seq=CAA;aliases=ENSP00000358548;source
=Ensembl

Deletion
############

.. code:: bash
$ transvar canno --ccds -i 'CCDS8856:c.769_771delGGG' --print-protein-pretty
::

CCDS8856:c.769_771delGGG CCDS8856 (protein_coding) AAAS -
chr12:g.53703427_53703429delCCC/c.769_771delGGG/p.G257delG inside_[cds_in_exon_8]
CSQN=InFrameDeletion;left_align_gDNA=g.53703424_53703426delCCC;unaligned_gDNA
=g.53703424_53703426delCCC;left_align_cDNA=c.766_768delGGG;unalign_cDNA=c.769
_771delGGG;left_align_protein=p.G256delG;unalign_protein=p.G257delG;variant_p
rotein_seq=MCSLGLFPPPPPRGQVTLYEHNNELVTGSSYESPPPDFRGQWINLPVLQLTKDPLKTPGRLDHGTR
TAFIHHREQVWKRCINIWRDVGLFGVLNEIANSEEEVFEWVKTASGWALALCRWASSLHGSLFPHLSLRSEDLIAEF
AQVTNWSSCCLRVFAWHPHTNKFAVALLDDSVRVYNASSTIVPSLKHRLQRNVASLAWKPLSASVLAVACQSCILIW
TLDPTSLSTRPSSGCAQVLSHPGHTPVTSLAWAPSG__[G_deletion]__RLLSASPVDAAIRVWDVSTETCVPL
PWFRGGGVTNLLWSPDGSKILATTPSAVFRVWEAQMWTCERWPTLSGRCQTGCWSPDGSRLLFTVLGEPLIYSLSFP
ERCGEGKGCVGGAKSATIVADLSETTIQTPDGEERLGGEAHSMVWDPSGERLAVLMKGKPRVQDGKPVILLFRTRNS
PVFELLPCGIIQGEPGAQPQLITFHPSFNKGALLSVGWSTGRIAHIPLYFVNAQFPRFSPVLGRAQEPPAGGGGSIH
DLPLFTETSPTSAPWDPLPGPPPVLPHSPHSHL*;source=CCDS

Insertion
############

.. code:: bash
$ transvar ganno -i 'chr2:g.69741762_69741763insTGC' --ccds --print-protein-pretty
::

chr2:g.69741762_69741763insTGC CCDS1893 (protein_coding) AAK1 -
chr2:g.69741780_69741782dupCTG/c.1614_1616dupGCA/p.Q546dupQ inside_[cds_in_exon_12]
CSQN=InFrameInsertion;left_align_gDNA=g.69741762_69741763insTGC;unalign_gDNA=
g.69741762_69741763insTGC;left_align_cDNA=c.1596_1597insCAG;unalign_cDNA=c.16
14_1616dupGCA;left_align_protein=p.Y532_Q533insQ;unalign_protein=p.Q539dupQ;v
ariant_protein_seq=MKKFFDSRREQGGSGLGSGSSGGGGSTSGLGSGYIGRVFGIGRQQVTVDEVLAEGGFA
IVFLVRTSNGMKCALKRMFVNNEHDLQVCKREIQIMRDLSGHKNIVGYIDSSINNVSSGDVWEVLILMDFCRGGQVV
NLMNQRLQTGFTENEVLQIFCDTCEAVARLHQCKTPIIHRDLKVENILLHDRGHYVLCDFGSATNKFQNPQTEGVNA
VEDEIKKYTTLSYRAPEMVNLYSGKIITTKADIWALGCLLYKLCYFTLPFGESQVAICDGNFTIPDNSRYSQDMHCL
IRYMLEPDPDKRPDIYQVSYFSFKLLKKECPIPNVQNSPIPAKLPEPVKASEAAAKKTQPKARLTDPIPTTETSIAP
RQRPKAGQTQPNPGILPIQPALTPRKRATVQPPPQAAGSSNQPGLLASVPQPKPQAPPSQPLPQTQAKQPQAPPTPQ
QTPSTQAQGLPAQAQATPQHQQQLFLKQQQQQQQPPPAQQQPAGTFYQQQQAQTQQFQAVHPATQKPAIAQFPVVSQ
GGSQQQLMQNFYQQQQQQQQQQQQQQ__[insert_Q]__LATALHQQQLMTQQAALQQKPTMAAGQQPQPQPAAAP
QPAPAQEPAIQAPVRQQPKVQTTPPPAVQGQKVGSLTPPSSPKTQRAGHRRILSDVTHSAVFGVPASKSTQLLQAAA
AEASLNKSKSATTTPSGSPRTSQQNVYNPSEGSTWNPFDDDNFSKLTAEELLNKDFAKLGEGKHPEKLGGSAESLIP
GFQSTQGDAFATTSFSAGTAEKRKGGQTVDSGLPLLSVSDPFIPLQVPDAPEKLIEGLKSPDTSLLLPDLLPMTDPF
GSTSDAVIEKADVAVESLIPGLEPPVPQRLPSQTESVTSNRTDSLTGEDSLLDCSLLSNPTTDLLEEFAPTAISAPV
HKAAEDSNLISGFDVPEGSDKVAEDEFDPIPVLITKNPQGGHSRNSSGSSESSLPNLARSLLLVDQLIDL*;phase
=2;source=CCDS


Frameshift sequence
######################

.. code:: bash
$ transvar canno --ccds -i 'CCDS8856:c.769_770delGG' --print-protein-pretty
::

CCDS8856:c.769_770delGG CCDS8856 (protein_coding) AAAS -
chr12:g.53703428_53703429delCC/c.770_771delGG/p.G257Afs*65 inside_[cds_in_exon_8]
CSQN=Frameshift;left_align_gDNA=g.53703424_53703425delCC;unaligned_gDNA=g.537
03425_53703426delCC;left_align_cDNA=c.766_767delGG;unalign_cDNA=c.769_770delG
G;variant_protein_seq=MCSLGLFPPPPPRGQVTLYEHNNELVTGSSYESPPPDFRGQWINLPVLQLTKDPL
KTPGRLDHGTRTAFIHHREQVWKRCINIWRDVGLFGVLNEIANSEEEVFEWVKTASGWALALCRWASSLHGSLFPHL
SLRSEDLIAEFAQVTNWSSCCLRVFAWHPHTNKFAVALLDDSVRVYNASSTIVPSLKHRLQRNVASLAWKPLSASVL
AVACQSCILIWTLDPTSLSTRPSSGCAQVLSHPGHTPVTSLAWAPSG__[frameshift_GRLLSASPVDAAIRVW
DVSTETCVPLPWFRGGGVTNLLWSPDGSKILATTPSAVFRVWEAQMWTCERWPTLSGRCQTGCWSPDGSRLLFTVLG
EPLIYSLSFPERCGEGKGCVGGAKSATIVADLSETTIQTPDGEERLGGEAHSMVWDPSGERLAVLMKGKPRVQDGKP
VILLFRTRNSPVFELLPCGIIQGEPGAQPQLITFHPSFNKGALLSVGWSTGRIAHIPLYFVNAQFPRFSPVLGRAQE
PPAGGGGSIHDLPLFTETSPTSAPWDPLPGPPPVLPHSPHSHL*>AAALSFTRGCCYPGMGCLNRDLCPPSLVPRRW
GDQPALVPRRQQNPGYHSFSCLSSLGGPDVDL*];source=CCDS


.. code:: bash
$ transvar canno -i 'CCDS54438:c.409_421del' --ccds --print-protein-pretty
::

CCDS54438:c.409_421del CCDS54438 (protein_coding) ATG16L1 +
chr2:g.234183368_234183380del13/c.409_421del13/p.T137Lfs*5 inside_[cds_in_exon_5]
CSQN=Frameshift;left_align_gDNA=g.234183367_234183379del13;unaligned_gDNA=g.2
34183368_234183380del13;left_align_cDNA=c.408_420del13;unalign_cDNA=c.409_421
del13;variant_protein_seq=MSSGLRAADFPRWKRHISEQLRRRDRLQRQAFEEIILQYNKLLEKSDLHSV
LAQKLQAEKHDVPNRHEIRRRQARLQKELAEAAKEPLPVEQDDDIEVIVDETSDHTEETSPVRAISRAATRRSVSSF
PVPQDNVD__[frameshift_THPGSGKEVRVPATALCVFDAHDGEVNAVQFSPGSRLLATGGMDRRVKLWEVFGE
KCEFKGSLSGSNAGITSIEFDSAGSYLLAASNDFASRIWTVDDYRLRHTLTGHSGKVLSAKFLLDNARIVSGSHDRT
LKLWDLRSKVCIKTVFAGSSCNDIVCTEQCVMSGHFDKKIRFWDIRSESIVREMELLGKITALDLNPERTELLSCSR
DDLLKVIDLRTNAIKQTFSAPGFKCGSDWTRVVFSPDGSYVAAGSAEGSLYIWSVLTGKVEKVLSKQHSSSINAVAW
SPSGSHVVSVDKGCKAVLWAQY*>LVKK*];source=CCDS

5 changes: 4 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
#!/usr/bin/env python

# The MIT License
#
# Copyright (c) 2016
# Wanding Zhou
#
# Copyright (c) 2014, 2015 The University of Texas MD Anderson Cancer Center
# Wanding Zhou, Tenghui Chen and Ken Chen
Expand All @@ -25,7 +28,7 @@
# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
#
# Contact: Ken Chen <[email protected]>
# Contact: Wanding Zhou <[email protected]>

import os
import sys
Expand Down
7 changes: 7 additions & 0 deletions test/test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,13 @@ colordiff testout/ganno_tamborero.vcf

transvar ganno -l data/tamborero_data/transvar_dna_input.txt --ensembl | tee testout/ganno_tamborero_output

## upload github

modify transvar/version.py
git commit -am "this version"
git tag -a v[version] -m "version [version]"
git push --tag

## register testpypi

python setup.py register -r https://testpypi.python.org/pypi
Expand Down
28 changes: 8 additions & 20 deletions transvar/deletion.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,14 +24,14 @@
ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
"""

from transcripts import *
from utils import *
from record import *
from copy import copy
from describe import *
from proteinseqs import *

# TODO: refactor left-right align
class GNucDeletion():
Expand Down Expand Up @@ -315,28 +315,16 @@ def taa_del_id(t, taa_beg, taa_end, args):

return s

def variant_protein_sequence_deletion(r, t, args, taa_beg, taa_end):

if args.pp or args.ppp:
pp = list(aaf(t.get_proteinseq(), args, use_list=True))
if args.pp:
del pp[taa_beg-1:taa_end]
elif args.ppp:
delseq = ''.join(pp[taa_beg-1:taa_end])
del pp[taa_beg-1:taa_end]
pp.insert(taa_beg-1, '__[%s_deletion]__' % delseq)
r.append_info('variant_protein_seq=%s' % ''.join(pp))

def taa_set_del(r, t, taa_beg, taa_end, args):

i1r, i2r = t.taa_roll_right_del(taa_beg, taa_end)
r.taa_range = taa_del_id(t, i1r, i2r, args)
i1l, i2l = t.taa_roll_left_del(taa_beg, taa_end)
r.append_info('left_align_protein=p.%s' %
taa_del_id(t, i1l, i2l, args))
r.append_info('unalign_protein=p.%s' % taa_del_id(
t, taa_beg, taa_end, args))
variant_protein_sequence_deletion(r, t, args, i1r, i2r)
r.append_info('unalign_protein=p.%s' %
taa_del_id(t, taa_beg, taa_end, args))
variant_protein_seq_deletion(r, t, args, i1r, i2r)

def del_coding_inframe(args, c1, c2, p1, p2, t, r):

Expand Down Expand Up @@ -399,11 +387,11 @@ def del_coding_frameshift(args, cbeg, cend, pbeg, pend, t, r):
if not old_seq:
raise IncompatibleTranscriptError("invalid_cDNA_position_%d;expect_[0_%d]" % cbeg_beg, len(t.seq))

ret = t.extend_taa_seq(cbeg.index, old_seq, new_seq)
if ret:
taa_pos, taa_ref, taa_alt, termlen = ret
r.taa_range = '%s%d%sfs*%d' % (aaf(taa_ref, args), taa_pos, aaf(taa_alt, args), termlen)
aae = t.extend_taa_seq(cbeg.index, old_seq, new_seq)
if aae:
r.taa_range = aae.format(args)
r.csqn.append("Frameshift")
variant_protein_seq_fs(r, t, aae, args)
else: # rare chance when stop codon seen before difference
r.taa_range = '(=)'
r.csqn.append("Synonymous")
32 changes: 14 additions & 18 deletions transvar/frameshift.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,13 +46,10 @@ def fuzzy_match_deletion(t, codon, q, args):
old_seq = t.seq[jb:]
new_seq = t.seq[jb:j]+t.seq[j+ds:]
tnuc_delseq = t.seq[j:j+ds]
ret = t.extend_taa_seq(j/3+1, old_seq, new_seq)
if ret:
_taa_pos, _taa_ref, _taa_alt, _termlen = ret
# print q.pos, q.ref, q.alt, q.stop_index, j, _taa_pos, _taa_ref, _taa_alt, _termlen
# print q.pos, q.ref, q.alt, type(q.stop_index), j, type(_taa_pos), type(_taa_ref), type(_taa_alt), type(_termlen)
if (q.ref == _taa_ref and ((not q.alt) or q.alt == _taa_alt)
and q.stop_index == _termlen and q.pos == _taa_pos):
aae = t.extend_taa_seq(j/3+1, old_seq, new_seq)
if aae:
if (q.ref == aae.taa_ref and ((not q.alt) or q.alt == aae.taa_alt)
and q.stop_index == aae.termlen and q.pos == aae.taa_pos):
t.ensure_position_array()
if t.strand == '+':
gnuc_beg, gnuc_end = t.np[j], t.np[j+ds-1]
Expand Down Expand Up @@ -117,17 +114,16 @@ def fuzzy_match_insertion_aa_change(t, j, ins_len, q):
for _insseq in itertools.product(alphabets, repeat=ins_len):
insseq = ''.join(_insseq)
new_seq = t.seq[jb:j]+insseq+t.seq[j:]
ret = t.extend_taa_seq(j/3+1, old_seq, new_seq)
if ret:
_taa_pos, _taa_ref, _taa_alt, _termlen = ret
if (((not q.alt) or _taa_alt == q.alt) and q.ref == _taa_ref
and q.pos == _taa_pos and _termlen <= q.stop_index):
m = FuzzyInsMatch()
m.insseq = insseq
m.termlen = _termlen
match_seq.append(m)
if _termlen == q.stop_index:
termlen_match = True
aae = t.extend_taa_seq(j/3+1, old_seq, new_seq)
if (aae and
(((not q.alt) or aae.taa_alt == q.alt) and q.ref == aae.taa_ref
and q.pos == aae.taa_pos and aae.termlen <= q.stop_index)):
m = FuzzyInsMatch()
m.insseq = insseq
m.termlen = aae.termlen
match_seq.append(m)
if m.termlen == q.stop_index:
termlen_match = True

# early stop when termlen also matches
if termlen_match:
Expand Down
Loading

0 comments on commit 8834d01

Please sign in to comment.