Skip to content

Kirovez/Scripts_sORFs_MS

Repository files navigation

A collection of python scripts (v3.5) used for P.patens sORF discovery work.

The corresponding MS is currently under revision and is available here: https://www.biorxiv.org/content/early/2017/11/03/213736

GffParser.py

Description: script to find introns and CDS coordinates in P.patens gff3 file. Usage: python3 GffParser.py

protein_Ka_Ks_codeml.py

Description: script to calculate dn/ds value. Basically, it takes two fasta files: proteins corresponding to sORF, transcript nucleotide sequences from other species. Usage: see --help. Example: python3 protein_Ka_Ks_codeml.py translated_FINAL_sORF_SELECTED.fa Zmays_284_Ensembl-18_2010-01-MaizeSequence.transcript.fa --blast T --threads 10 --makedb T

Dependencies:

Change variable query_seq_ind to the path to the file with sORF nucleotide sequences e.g. (default) query_seq_ind = SeqIO.index("/home/ilia/sORF/sORFfinder2/moss/FINAL_sORF_SELECTED.fa", "fasta")

  • oopBLASTv_forsORF.py script. Available in this repository

!!! NOTE !!!It requires some variables have to be manually changed. See info in the script files for details. One of the variables corresponds to the fasta file containing nucleotide sequences of sORFs (Important! ids of the protein and nucleotide sORF sequences MUST be identical).

  • Scripts required for protein_Ka_Ks_codeml.py:
  • 2.1 codemlParser folder contains module to run and parser Codeml program
  • 2.2 oopBLASTv_forsORF.py - script to run blast and parse the results. Usage examples: class takes query and reference fasta files Bl = BlastParser(r"/home/ilia/sORF/sORFfinder2/moss/FINAL_sORF_SELECTED.fa", r"/home/ilia/SOLID_moss/Ppatens_318_v3_index.fa", DB_build=False)

run BLAST Bl.runblast()

the function takes xml file. Default name for this file is "vs". Bl.parseBlastXml() It can also be used to parse external xml file. E.g.

parseBlastXml(file_exo="extranal_xml_blast.xml")

It returns .bed file with following columns:

  1. Query id
  2. Hit id
  3. Start HSP in hit (start position of hit sequence involved in alignment)
  4. Stop HSP in hit (stop position of hit sequence involved in alignment)
  5. E-value,
  6. hit HSP sequence,
  7. query HSP sequence
  8. length of hit HSP
  9. hit strand
  10. Start HSP in query (start position of query sequence involved in alignment)
  11. Stop HSP in query (stop position of query sequence involved in alignment)

KaKsloop.py

Descroption: example script which can be used to run protein_Ka_Ks_codeml.py for set of reference sequences

sORF_completeness_v2.0.py

Description: script to estimate changes in homologous sORF length between different species. It takes 1) genome/transcriptome fasta file, 2) fasta file with protein sequence translated from sORFs and 3) bed table created by oopBLASTv_forsORF.py script. The script return table with 18 column (actually the first 13 columns are identical to the input bed file):

  1. Query id
  2. Hit id
  3. Start HSP in hit (start position of hit sequence involved in alignment)
  4. Stop HSP in hit (stop position of hit sequence involved in alignment)
  5. E-value,
  6. hit HSP sequence,
  7. query HSP sequence
  8. length of hit HSP
  9. hit strand
  10. Blank column
  11. Start HSP in query (start position of query sequence involved in alignment)
  12. Stop HSP in query (stop position of query sequence involved in alignment)
  13. Predicted start Codon coordinates of homologous sORF
  14. Predicted stop Codon coordinates of homologous sORF
  15. Premature Stop Codon (- no PSC found)
  16. Predicted length of homologous sORF (if 0 – premature stop codon found before (upstream) HSP start), aa
  17. sORF query length, aa

sORFfastaToBed2.py

Description: Script to parse sORFfinder output file to generate bed file Usage: positional arguments: infile name of sORF fasta file generated by sORFfinder genomFile name of target genome fasta file used for sORFfinder outputfileName number of threads optional arguments: -h, --help show this help message and exit It writes a table with following columns:

  1. Chromosome name
  2. Start position
  3. End position
  4. Strand
  5. Coding index
  6. sORF name