A collection of python scripts (v3.5) used for P.patens sORF discovery work.

The corresponding MS is currently under revision and is available here: https://www.biorxiv.org/content/early/2017/11/03/213736

GffParser.py

Description: script to find introns and CDS coordinates in P.patens gff3 file. Usage: python3 GffParser.py

protein_Ka_Ks_codeml.py

Description: script to calculate dn/ds value. Basically, it takes two fasta files: proteins corresponding to sORF, transcript nucleotide sequences from other species. Usage: see --help. Example: python3 protein_Ka_Ks_codeml.py translated_FINAL_sORF_SELECTED.fa Zmays_284_Ensembl-18_2010-01-MaizeSequence.transcript.fa --blast T --threads 10 --makedb T

Dependencies:

perl, python, biopython
pal2nal perl script, available here https://github.com/HajkD/orthologr/tree/master/inst/pal2nal/pal2nal.v14. !!! set pal2nal variable to run pal2nal script e.g. pal2nal=r"perl ./pal2nal.v14/pal2nal.pl" (default)
blast. Please, if it is required, change blast variable in class e.g. blast=r"tblastn" (default)
codeml. .ctl file can be found in this directory
clustalo
codeml Parser module (it is located at codemlParser directory)

Change variable query_seq_ind to the path to the file with sORF nucleotide sequences e.g. (default) query_seq_ind = SeqIO.index("/home/ilia/sORF/sORFfinder2/moss/FINAL_sORF_SELECTED.fa", "fasta")

oopBLASTv_forsORF.py script. Available in this repository

!!! NOTE !!!It requires some variables have to be manually changed. See info in the script files for details. One of the variables corresponds to the fasta file containing nucleotide sequences of sORFs (Important! ids of the protein and nucleotide sORF sequences MUST be identical).

Scripts required for protein_Ka_Ks_codeml.py:
2.1 codemlParser folder contains module to run and parser Codeml program
2.2 oopBLASTv_forsORF.py - script to run blast and parse the results. Usage examples: class takes query and reference fasta files Bl = BlastParser(r"/home/ilia/sORF/sORFfinder2/moss/FINAL_sORF_SELECTED.fa", r"/home/ilia/SOLID_moss/Ppatens_318_v3_index.fa", DB_build=False)

run BLAST Bl.runblast()

the function takes xml file. Default name for this file is "vs". Bl.parseBlastXml() It can also be used to parse external xml file. E.g.

parseBlastXml(file_exo="extranal_xml_blast.xml")

It returns .bed file with following columns:

Query id
Hit id
Start HSP in hit (start position of hit sequence involved in alignment)
Stop HSP in hit (stop position of hit sequence involved in alignment)
E-value,
hit HSP sequence,
query HSP sequence
length of hit HSP
hit strand
Start HSP in query (start position of query sequence involved in alignment)
Stop HSP in query (stop position of query sequence involved in alignment)

KaKsloop.py

Descroption: example script which can be used to run protein_Ka_Ks_codeml.py for set of reference sequences

sORF_completeness_v2.0.py

Description: script to estimate changes in homologous sORF length between different species. It takes 1) genome/transcriptome fasta file, 2) fasta file with protein sequence translated from sORFs and 3) bed table created by oopBLASTv_forsORF.py script. The script return table with 18 column (actually the first 13 columns are identical to the input bed file):

Query id
Hit id
Start HSP in hit (start position of hit sequence involved in alignment)
Stop HSP in hit (stop position of hit sequence involved in alignment)
E-value,
hit HSP sequence,
query HSP sequence
length of hit HSP
hit strand
Blank column
Start HSP in query (start position of query sequence involved in alignment)
Stop HSP in query (stop position of query sequence involved in alignment)
Predicted start Codon coordinates of homologous sORF
Predicted stop Codon coordinates of homologous sORF
Premature Stop Codon (- no PSC found)
Predicted length of homologous sORF (if 0 – premature stop codon found before (upstream) HSP start), aa
sORF query length, aa

sORFfastaToBed2.py

Description: Script to parse sORFfinder output file to generate bed file Usage: positional arguments: infile name of sORF fasta file generated by sORFfinder genomFile name of target genome fasta file used for sORFfinder outputfileName number of threads optional arguments: -h, --help show this help message and exit It writes a table with following columns:

Chromosome name
Start position
End position
Strand
Coding index
sORF name

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
codemlParser		codemlParser
.gitignore		.gitignore
GffParser.py		GffParser.py
KaKsloop.py		KaKsloop.py
README.md		README.md
Visualize_spectra_colored.R		Visualize_spectra_colored.R
codeml.ctl		codeml.ctl
oopBLASTv_forsORFv2.0.py		oopBLASTv_forsORFv2.0.py
protein_Ka_Ks_codeml.py		protein_Ka_Ks_codeml.py
sORF_completeness_v2.1.py		sORF_completeness_v2.1.py
sORFfastaToBed2_cmd.py		sORFfastaToBed2_cmd.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A collection of python scripts (v3.5) used for P.patens sORF discovery work.

GffParser.py

protein_Ka_Ks_codeml.py

KaKsloop.py

sORF_completeness_v2.0.py

sORFfastaToBed2.py

About

Releases 1

Packages

Languages

Kirovez/Scripts_sORFs_MS

Folders and files

Latest commit

History

Repository files navigation

A collection of python scripts (v3.5) used for P.patens sORF discovery work.

GffParser.py

protein_Ka_Ks_codeml.py

KaKsloop.py

sORF_completeness_v2.0.py

sORFfastaToBed2.py

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages