Skip to content
Wang Yunfei edited this page Feb 9, 2017 · 1 revision

BedMap

  • BedMap is constructed based on the BIN system.
  • The Beds in BedMap are well organized according to their BIN values for fast overlapping search.

BIN system

  • BIN values are numbers for genome fragments. (Heng Li, Bioinformatics, 2011)
  • The genome is split into 16Kb fragments. All of them are assigned to Level 7 BIN values(4681~37449) in order.
  • 8 continuous Level N BINs are assigned to a Level N-1 value.
  • The Beds in each BIN are sorted for fast search.

BIN schematic diagram: (Note: this is a schematic diagram. Actually a Level N-1 fragment can cover 8 (2^3) Level N fragments, instead of 2 Level N fragments here.) bedmap0

# bins
_numBins   = 37450;
_binLevels = 7;

# bins range in size from 16kb to 512Mb
# Bin  0          spans 512Mbp,   # Level 1
# Bins 1-8        span 64Mbp,     # Level 2
# Bins 9-72       span 8Mbp,      # Level 3
# Bins 73-584     span 1Mbp       # Level 4
# Bins 585-4680   span 128Kbp     # Level 5
# Bins 4681-37449 span 16Kbp      # Level 6
_binOffsetsExtended = [32678+4096+512+64+8+1, 4096+512+64+8+1, 512+64+8+1, 64+8+1, 8+1, 1, 0]
_binFirstShift = 14;      # How much to shift to get to finest bin.
_binNextShift  = 3;       # How much to shift to get to next larger bin.

BIN value calculation

  • At the first time, the start and stop position is right shifted 14 bits (2^14=16K). If the Bed is located in a Level 7 fragment, the shifted start and stop will be equal.
  • If the Bed span covers multiple Level 7 fragments, the right shift (3 bits each time) will continue until the shifted start is equal to the shifted stop. The BIN Level increases every shift.
  • The BIN value is set to shifted start + BIN_Level_Offset
    def getBIN(self):
        '''Get the genome BIN.'''
        start=self.start>>_binFirstShift
        stop=(self.stop-1)>>_binFirstShift
        for i in range(_binLevels):
            if start==stop:
                return _binOffsetsExtended[i] + start
            else:
                start >>= _binNextShift
                stop  >>= _binNextShift
        print >>sys.stderr,"Error! Bed range is out of 512M."

BedMap construction

  • BedMap is a dictionary of dictionary, with the first level key as chromosome names and second level key as BIN values: BedMap[chrom][BIN]
  • If a Bed can be covered by a BIN fragment and cannot be covered by any shorter BIN fragments, the Bed will be included in that BIN group.

Data structure of BedMap: bedmap1

BedMap definition:

class BedMap(dict):
    '''
    Store Beds in dictionary with chromsomes and BIN values as keys.
    Usage:
        bedmap = BedMap() # initiate the BedMap
        bedmap.loadBedToMap("mm9_miRNA.bed",'bed') # load Beds into BedMap
        for tbed,overlen in bedmap.findOverlap(querybed):  # print all Beds overlapped with query bed.
            print tbed, overlen
    '''
   5 lines:     def __init__(self):------------------------------------------------------------------------------------------------------
  16 lines:     def findOverlap(self,bed):-----------------------------------------------------------------------------------------------
  10 lines:     def loadBedToMap(self,bedlist=None,ftype='bed'):-------------------------------------------------------------------------

Example: load mouse mm9 miRNA annotation file into BedMap

 bedmap = BedMap() # initiate BedMap with chromosome information
bedmap.loadBedToMap("mm9_miRNA.bed","bed")

Source code of loadBedToMap:

    def loadBedToMap(self,bedlist=None,ftype='bed'):
        '''Load Bed to BedMap from either Bedlist or bedfile. Load one bedlist once.'''
        if bedlist:
            if isinstance(bedlist,str):
                bedlist=IO.BioReader(bedlist,ftype)
            for bed in bedlist:
                _bin=bed.getBIN()
                self.setdefault(bed.chrom, {})
                if not self[bed.chrom].has_key(_bin):
                    self[bed.chrom][_bin] = BedList()
                    self[bed.chrom][_bin].sort()
                self[bed.chrom][_bin].insort(bed)
            self.ftype=ftype

Search Bed in BedMap

  • Given a Bed, we should check all the BIN fragments that are overlapped with this Bed, from Level 7 to Level 0.

For example, if we have a Bed with start position in BIN-4681 and stop position in BIN-4682, we should check if the Bed is overlapped with any elements in these BIN groups( Level 7:4681-4682, Level 6:585, Level 5: 73, Level 4: 9 ......)

Source code:

    def findOverlap(self,bed):
        ''' Find overlaps with bed and put into bedlist. '''
        startBin,stopBin= bed.start >> _binFirstShift, (bed.stop-1) >> _binFirstShift
        for i in range(_binLevels):
            offset = _binOffsetsExtended[i]
            for j in xrange(startBin+offset,stopBin+offset+1):
                if self[bed.chrom].has_key(j):
                    for item, overlen in self[bed.chrom].findOverlap(bed):
                        yield (item, overlen)
            startBin >>= _binNextShift
            stopBin >>= _binNextShift

Example: find overlap for a given query bed

querybed = Bed("chrX\t6813821\t6815821")
for tbed,overlen in bedmap.findOverlap(querybed):
    print str(tbed)+"\t"+str(overlen)

Output: The last field shows the overlap length

chrX	6814856	6814880	MIMAT0017258_1;mmu-miR-500-5p;MI0004702_1	0.00 	-	24
chrX	6814821	6814842	MIMAT0003507_1;mmu-miR-500-3p;MI0004702_1	0.00 	-	21
Clone this wiki locally