A collection of python scripts (v3.5) used for P.patens sORF discovery work.
The corresponding MS is currently under revision and is available here: https://www.biorxiv.org/content/early/2017/11/03/213736
Description: script to find introns and CDS coordinates in P.patens gff3 file. Usage: python3 GffParser.py
Description: script to calculate dn/ds value. Basically, it takes two fasta files: proteins corresponding to sORF, transcript nucleotide sequences from other species. Usage: see --help. Example: python3 protein_Ka_Ks_codeml.py translated_FINAL_sORF_SELECTED.fa Zmays_284_Ensembl-18_2010-01-MaizeSequence.transcript.fa --blast T --threads 10 --makedb T
Dependencies:
-
perl, python, biopython
-
pal2nal perl script, available here https://github.com/HajkD/orthologr/tree/master/inst/pal2nal/pal2nal.v14. !!! set
pal2nal
variable to run pal2nal script e.g.pal2nal=r"perl ./pal2nal.v14/pal2nal.pl"
(default) -
blast. Please, if it is required, change
blast
variable in class e.g.blast=r"tblastn"
(default) -
codeml. .ctl file can be found in this directory
-
clustalo
-
codeml Parser module (it is located at codemlParser directory)
Change variable query_seq_ind
to the path to the file with sORF nucleotide sequences
e.g. (default) query_seq_ind = SeqIO.index("/home/ilia/sORF/sORFfinder2/moss/FINAL_sORF_SELECTED.fa", "fasta")
- oopBLASTv_forsORF.py script. Available in this repository
!!! NOTE !!!It requires some variables have to be manually changed. See info in the script files for details. One of the variables corresponds to the fasta file containing nucleotide sequences of sORFs (Important! ids of the protein and nucleotide sORF sequences MUST be identical).
- Scripts required for protein_Ka_Ks_codeml.py:
- 2.1 codemlParser folder contains module to run and parser Codeml program
- 2.2 oopBLASTv_forsORF.py - script to run blast and parse the results.
Usage
examples:
class takes query and reference fasta files
Bl = BlastParser(r"/home/ilia/sORF/sORFfinder2/moss/FINAL_sORF_SELECTED.fa", r"/home/ilia/SOLID_moss/Ppatens_318_v3_index.fa", DB_build=False)
run BLAST
Bl.runblast()
the function takes xml file. Default name for this file is "vs".
Bl.parseBlastXml()
It can also be used to parse external xml file. E.g.
parseBlastXml(file_exo="extranal_xml_blast.xml")
It returns .bed file with following columns:
- Query id
- Hit id
- Start HSP in hit (start position of hit sequence involved in alignment)
- Stop HSP in hit (stop position of hit sequence involved in alignment)
- E-value,
- hit HSP sequence,
- query HSP sequence
- length of hit HSP
- hit strand
- Start HSP in query (start position of query sequence involved in alignment)
- Stop HSP in query (stop position of query sequence involved in alignment)
Descroption: example script which can be used to run protein_Ka_Ks_codeml.py for set of reference sequences
Description: script to estimate changes in homologous sORF length between different species. It takes 1) genome/transcriptome fasta file, 2) fasta file with protein sequence
translated from sORFs and 3) bed table created by oopBLASTv_forsORF.py
script.
The script
return table with 18 column (actually the first 13 columns are identical to the input bed file):
- Query id
- Hit id
- Start HSP in hit (start position of hit sequence involved in alignment)
- Stop HSP in hit (stop position of hit sequence involved in alignment)
- E-value,
- hit HSP sequence,
- query HSP sequence
- length of hit HSP
- hit strand
- Blank column
- Start HSP in query (start position of query sequence involved in alignment)
- Stop HSP in query (stop position of query sequence involved in alignment)
- Predicted start Codon coordinates of homologous sORF
- Predicted stop Codon coordinates of homologous sORF
- Premature Stop Codon (- no PSC found)
- Predicted length of homologous sORF (if 0 – premature stop codon found before (upstream) HSP start), aa
- sORF query length, aa
Description: Script to parse sORFfinder output file to generate bed file Usage: positional arguments: infile name of sORF fasta file generated by sORFfinder genomFile name of target genome fasta file used for sORFfinder outputfileName number of threads optional arguments: -h, --help show this help message and exit It writes a table with following columns:
- Chromosome name
- Start position
- End position
- Strand
- Coding index
- sORF name