Skip to content
Wang Yunfei edited this page Feb 9, 2017 · 3 revisions

BioReader

  • BioReader is a universal wrapper of all other Readers and Iterators. If no ftype (ftype=None) is provided, it will return a vector of strings for each line.
  • Converters: For simple file format in which one single line is an object, such as Bed and GenePred, a converter is required to convert each line to an object. Converters are included in the variable converters.
  • Readers: For complex file format, a reader is required to parse the file and return objects iteratively. Readers are specified in the variable readers.
  • User can add new Converters and Readers to the IO class.

Definition of BioReader class:

def BioReader(infile,ftype=None, **kwargs):
    '''
    Read most of the iterative biological files into Python objects.
    Usage:
        for item in IO.BioReader(infile, sep="\t", skip=10, mask="#"):
            print item
    Parameters:
        ftype: default(None), Bed, Gene/GeneBed/Tab/GenePred, wig/bigwig and sam/bam. Convert line(s) into an object or a vector of strings.
        sep: character that separate each column, default("\t").
        skip: numeric, default 0. Skip first n lines.
        mask: character, default "#". Mask lines start with "#".
    Output:
        By default, it returns a vector of strings split by 'sep'.
        For any text file has at least chrom, start and stop, the 'ftype' can be provided as the indices of the Bed elements, such as "2;3;1;-;4;6". "-" indicates missing values.
        Other file types are converted to Python objects such Bed, GeneBed and SAM.
        NOTE: Set "ftype= None" for any text files if you don't want to convert them to objects, such as SAM, GTF and GFF format.
    '''
    converters={ None:lambda x:x,'any':lambda x: IO.AnyToBed(x,fidx),'bed':Bed, 'gene':GeneBed, 'tab':GeneBed, 'genepred':GeneBed}
    # converters.extend( {'bowtie':BowtieToBed,'soap':SOAPToBed} )
    readers={'fasta': lambda x: IO.SeqReader(x,'fasta'),'fastq':lambda x: IO.SeqReader(x,'fastq'),'wig':lambda x: IO.WigReader(x,'wig')}
    # readers.extend( {'gff':IO.GFFReader,'gtf':IO.GTFReader} )
    ......

Table files

  • For simple table files, no ftype need to be specified. The BioReader just convert each line to a string vector.
  • User can specify sep, skip and mask values.
  1. Example: For a given file with comment lines start with "#" and columns separated by "\t", we want to skip the first 10 records:
from ngslib import IO
for line in IO.BioReader(infile, sep="\\t", skip=10, mask='#'):
    do something with line

Converters

  • Converters are a group of functions accept a string vector as input and convert it into an object.
  • For some simple format such as Bed and GeneBed format, which contains one object in one line, we can convert line by line.
  • For some uncommon format such as SOAP and Bowtie, we don't bother to write a complex Reader.
  • For any text contains at least chrom, start and end on each line, we can convert it to Bed format.
  1. Example: we can convert SOAP mapping result into a Bed object like this:
from ngslib import IO
for tbed in IO.BioReader(infile, ftype='soap'):
    print tbed
  1. Example: For any text file contains chrom, start and end information like this:
\#bin    name    chrom   strand  txStart txEnd   cdsStart        cdsEnd
0       NM_001195025    chr1    +       134212701       134230065       134212806       134228958
0       NM_028778       chr1    +       134212701       134230065       134212806       134228958
1       NM_008922       chr1    -       33510655        33726603        33510930        33725856

We can convert it to Bed object by indicating the corresponding indices of the Bed elements. "-" indicates missing values and "-"(s) in the end can be omitted.
Input: without score information

from ngslib import IO
for tbed in IO.BioReader(infile, ftype='3:5:6:2:-:4', mask="#"): # no score information
    print tbed

Output: scores are set to 0.00 as default.

chr1    134212701       134230065       NM_001195025    0.00    +
chr1    134212701       134230065       NM_028778       0.00    +
chr1    33510655        33726603        NM_008922       0.00    -

Input: without score and strand information

from ngslib import IO
for tbed in IO.BioReader(infile, ftype='3:5:6:2', mask="#"): # no score and strand information
    print tbed

Output: scores and strand are set to default values.

chr1    134212701       134230065       NM_001195025    0.00    .
chr1    134212701       134230065       NM_028778       0.00    .
chr1    33510655        33726603        NM_008922       0.00    .

Readers

  • Readers are a group of functions to parse specific file format.
  • Readers accept infile and ftype as input, and yield objects from file iteratively.
  • Readers can be written in c/c++ as static libraries as long as a Python wrapper is written.
  1. Example: Given a FASTA file (mm9_miRNA.fa) like this:
>MIMAT0025084_1;mmu-miR-6341;MI0021869_1
AGTGCAATGATATTGTCACTAT
>MIMAT0017004_1;mmu-miR-206-5p;MI0000249_1
CATGCTTCTTTATATCCTCATA
>MIMAT0000239_1;mmu-miR-206-3p;MI0000249_1
GGAATGTAAGGAAGTGTGTGG

We will calculate the GC content of each sequence. Note here BioReader returns an FASTA object.

from ngslib import IO,Utils
for tseq in IO.BioReader("mm9_miRNA.fa",'fasta'):
    print tseq.id+"\t"+str(round(Utils.GCContent(tseq.seq),2))

Output:

MIMAT0025084_1;mmu-miR-6341;MI0021869_1 0.32
MIMAT0017004_1;mmu-miR-206-5p;MI0000249_1       0.32
MIMAT0000239_1;mmu-miR-206-3p;MI0000249_1       0.48
  1. Source code: for SeqReader function. Here we use generator to iteratively yield objects.
def SeqReader(infile,ftype='fasta'):
        '''Read sequence files.'''
        ftype=ftype.lower()
        #Read lines
        with IO.mopen(infile) as fh:
            if ftype=='fasta':
                line = fh.next()
                if line[0] != ">":
                    raise ValueError("Records in Fasta files should start with '>' character")
                line = line.lstrip('>').rstrip().replace('\t',' ').split(' ')
                name = line[0]
                desc = ' '.join(line[1:])
                seq = ''
                while True:
                    try:
                        line = fh.next()
                    except:
                        if seq != '':
                            yield Fasta(name, seq, desc)
                        raise StopIteration
                    if line[0] != ">":
                        seq += line.rstrip()
                    else:
                        yield Fasta(name, seq, desc)
                        line = line.lstrip('>').rstrip().replace('\t',' ').split(' ')
                        name = line[0]
                        desc = ' '.join(line[1:])
                        seq = ''
            elif ftype=='fastq':
                while True:
                    try:
                        fid=fh.next().rstrip().lstrip('@')
                        seq=fh.next().rstrip()
                        fh.next()
                        qual = fh.next().rstrip()
                        yield Fastq(fid,seq,qual)
                    except:
                        raise StopIteration
            else:
                raise TypeError(ftype+" format is not supported.")
            assert False, "Do not reach this line."
    SeqReader=staticmethod(SeqReader)

The user can define their own Readers in a similar way.

Clone this wiki locally