Skip to content
Wang Yunfei edited this page Feb 9, 2017 · 2 revisions

FastaFile

  • FastaFile is an alternative way of TwoBitFile to read huge genome files. It is included in the pysam package.
  • Pros: Do not need to convert Fasta file into other format, but just a simple index file created by Samtools.
  • Cons: Taking too much space (4 times to 2bit files); Sequence line should have the same length except the last one. Doesn't have chromosome size interface.

Definition of FastaFile class:

class FastaFile(object):
    '''  
    Fasta file fast reader. Usually used for huge genome fasta files.
    Usage:
        Open file:
            fio=FastaFile("K12.fa")
        Get Sequence:
            fio.getSeq(chrom="K12",start=100,stop=200,strand="+")
        Close file:
            fio.close()
    Parameters:
        chrom=None: return empty string.
        start=None: start at first position
        stop=None:  stop at the end of record.
        strand: default "+"
    '''
   6 lines:     def __init__(self,fname=USERHOME+"/Data/hg19/hg19.fa"):---------
   6 lines:     def getSeq(self,chrom,start=None,stop=None,strand="+"):---------
   5 lines:     def close(self):------------------------------------------------
   4 lines:     def __del__(self):---------------------------------------------- 

Example: get sequences from fa file.

Note: FastaFile will create an index for the fasta file like *.fa.fai when the it is loaded the first time. This step may take a while if the fasta file is huge.

from ngslib import FastaFile
fio=FastaFile("test.fa")
seq = fio.getSeq(chrom="K12",start=100,stop=200,strand="+")
print seq
fio.close() # file will be closed automatically if forget to close it here.

Output:

AATATGAAGTTCTTTAGCATAACAAGGATCTGCCTTTGTAAAAGAAaaagaaagaaagagcgaaagaaagaaaAGAACTGAGGACAGCATTCTTTTCTCT
Clone this wiki locally