-
Notifications
You must be signed in to change notification settings - Fork 3
BioReader
Wang Yunfei edited this page Feb 9, 2017
·
3 revisions
- BioReader is a universal wrapper of all other Readers and Iterators. If no ftype (ftype=None) is provided, it will return a vector of strings for each line.
- Converters: For simple file format in which one single line is an object, such as Bed and GenePred, a converter is required to convert each line to an object. Converters are included in the variable converters.
- Readers: For complex file format, a reader is required to parse the file and return objects iteratively. Readers are specified in the variable readers.
- User can add new Converters and Readers to the IO class.
Definition of BioReader class:
def BioReader(infile,ftype=None, **kwargs):
'''
Read most of the iterative biological files into Python objects.
Usage:
for item in IO.BioReader(infile, sep="\t", skip=10, mask="#"):
print item
Parameters:
ftype: default(None), Bed, Gene/GeneBed/Tab/GenePred, wig/bigwig and sam/bam. Convert line(s) into an object or a vector of strings.
sep: character that separate each column, default("\t").
skip: numeric, default 0. Skip first n lines.
mask: character, default "#". Mask lines start with "#".
Output:
By default, it returns a vector of strings split by 'sep'.
For any text file has at least chrom, start and stop, the 'ftype' can be provided as the indices of the Bed elements, such as "2;3;1;-;4;6". "-" indicates missing values.
Other file types are converted to Python objects such Bed, GeneBed and SAM.
NOTE: Set "ftype= None" for any text files if you don't want to convert them to objects, such as SAM, GTF and GFF format.
'''
converters={ None:lambda x:x,'any':lambda x: IO.AnyToBed(x,fidx),'bed':Bed, 'gene':GeneBed, 'tab':GeneBed, 'genepred':GeneBed}
# converters.extend( {'bowtie':BowtieToBed,'soap':SOAPToBed} )
readers={'fasta': lambda x: IO.SeqReader(x,'fasta'),'fastq':lambda x: IO.SeqReader(x,'fastq'),'wig':lambda x: IO.WigReader(x,'wig')}
# readers.extend( {'gff':IO.GFFReader,'gtf':IO.GTFReader} )
......
- For simple table files, no ftype need to be specified. The BioReader just convert each line to a string vector.
- User can specify sep, skip and mask values.
- Example: For a given file with comment lines start with "#" and columns separated by "\t", we want to skip the first 10 records:
from ngslib import IO
for line in IO.BioReader(infile, sep="\\t", skip=10, mask='#'):
do something with line
- Converters are a group of functions accept a string vector as input and convert it into an object.
- For some simple format such as Bed and GeneBed format, which contains one object in one line, we can convert line by line.
- For some uncommon format such as SOAP and Bowtie, we don't bother to write a complex Reader.
- For any text contains at least chrom, start and end on each line, we can convert it to Bed format.
- Example: we can convert SOAP mapping result into a Bed object like this:
from ngslib import IO
for tbed in IO.BioReader(infile, ftype='soap'):
print tbed
- Example: For any text file contains chrom, start and end information like this:
\#bin name chrom strand txStart txEnd cdsStart cdsEnd 0 NM_001195025 chr1 + 134212701 134230065 134212806 134228958 0 NM_028778 chr1 + 134212701 134230065 134212806 134228958 1 NM_008922 chr1 - 33510655 33726603 33510930 33725856
We can convert it to Bed object by indicating the corresponding indices of the Bed elements. "-" indicates missing values and "-"(s) in the end can be omitted.
Input: without score information
from ngslib import IO
for tbed in IO.BioReader(infile, ftype='3:5:6:2:-:4', mask="#"): # no score information
print tbed
Output: scores are set to 0.00 as default.
chr1 134212701 134230065 NM_001195025 0.00 + chr1 134212701 134230065 NM_028778 0.00 + chr1 33510655 33726603 NM_008922 0.00 -
Input: without score and strand information
from ngslib import IO
for tbed in IO.BioReader(infile, ftype='3:5:6:2', mask="#"): # no score and strand information
print tbed
Output: scores and strand are set to default values.
chr1 134212701 134230065 NM_001195025 0.00 . chr1 134212701 134230065 NM_028778 0.00 . chr1 33510655 33726603 NM_008922 0.00 .
- Readers are a group of functions to parse specific file format.
- Readers accept infile and ftype as input, and yield objects from file iteratively.
- Readers can be written in c/c++ as static libraries as long as a Python wrapper is written.
- Example: Given a FASTA file (mm9_miRNA.fa) like this:
>MIMAT0025084_1;mmu-miR-6341;MI0021869_1 AGTGCAATGATATTGTCACTAT >MIMAT0017004_1;mmu-miR-206-5p;MI0000249_1 CATGCTTCTTTATATCCTCATA >MIMAT0000239_1;mmu-miR-206-3p;MI0000249_1 GGAATGTAAGGAAGTGTGTGG
We will calculate the GC content of each sequence. Note here BioReader returns an FASTA object.
from ngslib import IO,Utils
for tseq in IO.BioReader("mm9_miRNA.fa",'fasta'):
print tseq.id+"\t"+str(round(Utils.GCContent(tseq.seq),2))
Output:
MIMAT0025084_1;mmu-miR-6341;MI0021869_1 0.32 MIMAT0017004_1;mmu-miR-206-5p;MI0000249_1 0.32 MIMAT0000239_1;mmu-miR-206-3p;MI0000249_1 0.48
- Source code: for SeqReader function. Here we use generator to iteratively yield objects.
def SeqReader(infile,ftype='fasta'):
'''Read sequence files.'''
ftype=ftype.lower()
#Read lines
with IO.mopen(infile) as fh:
if ftype=='fasta':
line = fh.next()
if line[0] != ">":
raise ValueError("Records in Fasta files should start with '>' character")
line = line.lstrip('>').rstrip().replace('\t',' ').split(' ')
name = line[0]
desc = ' '.join(line[1:])
seq = ''
while True:
try:
line = fh.next()
except:
if seq != '':
yield Fasta(name, seq, desc)
raise StopIteration
if line[0] != ">":
seq += line.rstrip()
else:
yield Fasta(name, seq, desc)
line = line.lstrip('>').rstrip().replace('\t',' ').split(' ')
name = line[0]
desc = ' '.join(line[1:])
seq = ''
elif ftype=='fastq':
while True:
try:
fid=fh.next().rstrip().lstrip('@')
seq=fh.next().rstrip()
fh.next()
qual = fh.next().rstrip()
yield Fastq(fid,seq,qual)
except:
raise StopIteration
else:
raise TypeError(ftype+" format is not supported.")
assert False, "Do not reach this line."
SeqReader=staticmethod(SeqReader)
The user can define their own Readers in a similar way.