-
Notifications
You must be signed in to change notification settings - Fork 3
BedMap
Wang Yunfei edited this page Feb 9, 2017
·
1 revision
- BedMap is constructed based on the BIN system.
- The Beds in BedMap are well organized according to their BIN values for fast overlapping search.
- BIN values are numbers for genome fragments. (Heng Li, Bioinformatics, 2011)
- The genome is split into 16Kb fragments. All of them are assigned to Level 7 BIN values(4681~37449) in order.
- 8 continuous Level N BINs are assigned to a Level N-1 value.
- The Beds in each BIN are sorted for fast search.
BIN schematic diagram: (Note: this is a schematic diagram. Actually a Level N-1 fragment can cover 8 (2^3) Level N fragments, instead of 2 Level N fragments here.)
# bins _numBins = 37450; _binLevels = 7; # bins range in size from 16kb to 512Mb # Bin 0 spans 512Mbp, # Level 1 # Bins 1-8 span 64Mbp, # Level 2 # Bins 9-72 span 8Mbp, # Level 3 # Bins 73-584 span 1Mbp # Level 4 # Bins 585-4680 span 128Kbp # Level 5 # Bins 4681-37449 span 16Kbp # Level 6 _binOffsetsExtended = [32678+4096+512+64+8+1, 4096+512+64+8+1, 512+64+8+1, 64+8+1, 8+1, 1, 0] _binFirstShift = 14; # How much to shift to get to finest bin. _binNextShift = 3; # How much to shift to get to next larger bin.
- At the first time, the start and stop position is right shifted 14 bits (2^14=16K). If the Bed is located in a Level 7 fragment, the shifted start and stop will be equal.
- If the Bed span covers multiple Level 7 fragments, the right shift (3 bits each time) will continue until the shifted start is equal to the shifted stop. The BIN Level increases every shift.
- The BIN value is set to shifted start + BIN_Level_Offset
def getBIN(self):
'''Get the genome BIN.'''
start=self.start>>_binFirstShift
stop=(self.stop-1)>>_binFirstShift
for i in range(_binLevels):
if start==stop:
return _binOffsetsExtended[i] + start
else:
start >>= _binNextShift
stop >>= _binNextShift
print >>sys.stderr,"Error! Bed range is out of 512M."
- BedMap is a dictionary of dictionary, with the first level key as chromosome names and second level key as BIN values: BedMap[chrom][BIN]
- If a Bed can be covered by a BIN fragment and cannot be covered by any shorter BIN fragments, the Bed will be included in that BIN group.
Data structure of BedMap:
BedMap definition:
class BedMap(dict):
'''
Store Beds in dictionary with chromsomes and BIN values as keys.
Usage:
bedmap = BedMap() # initiate the BedMap
bedmap.loadBedToMap("mm9_miRNA.bed",'bed') # load Beds into BedMap
for tbed,overlen in bedmap.findOverlap(querybed): # print all Beds overlapped with query bed.
print tbed, overlen
'''
5 lines: def __init__(self):------------------------------------------------------------------------------------------------------
16 lines: def findOverlap(self,bed):-----------------------------------------------------------------------------------------------
10 lines: def loadBedToMap(self,bedlist=None,ftype='bed'):-------------------------------------------------------------------------
Example: load mouse mm9 miRNA annotation file into BedMap
bedmap = BedMap() # initiate BedMap with chromosome information
bedmap.loadBedToMap("mm9_miRNA.bed","bed")
Source code of loadBedToMap:
def loadBedToMap(self,bedlist=None,ftype='bed'):
'''Load Bed to BedMap from either Bedlist or bedfile. Load one bedlist once.'''
if bedlist:
if isinstance(bedlist,str):
bedlist=IO.BioReader(bedlist,ftype)
for bed in bedlist:
_bin=bed.getBIN()
self.setdefault(bed.chrom, {})
if not self[bed.chrom].has_key(_bin):
self[bed.chrom][_bin] = BedList()
self[bed.chrom][_bin].sort()
self[bed.chrom][_bin].insort(bed)
self.ftype=ftype
- Given a Bed, we should check all the BIN fragments that are overlapped with this Bed, from Level 7 to Level 0.
For example, if we have a Bed with start position in BIN-4681 and stop position in BIN-4682, we should check if the Bed is overlapped with any elements in these BIN groups( Level 7:4681-4682, Level 6:585, Level 5: 73, Level 4: 9 ......)
Source code:
def findOverlap(self,bed):
''' Find overlaps with bed and put into bedlist. '''
startBin,stopBin= bed.start >> _binFirstShift, (bed.stop-1) >> _binFirstShift
for i in range(_binLevels):
offset = _binOffsetsExtended[i]
for j in xrange(startBin+offset,stopBin+offset+1):
if self[bed.chrom].has_key(j):
for item, overlen in self[bed.chrom].findOverlap(bed):
yield (item, overlen)
startBin >>= _binNextShift
stopBin >>= _binNextShift
Example: find overlap for a given query bed
querybed = Bed("chrX\t6813821\t6815821")
for tbed,overlen in bedmap.findOverlap(querybed):
print str(tbed)+"\t"+str(overlen)
Output: The last field shows the overlap length
chrX 6814856 6814880 MIMAT0017258_1;mmu-miR-500-5p;MI0004702_1 0.00 - 24 chrX 6814821 6814842 MIMAT0003507_1;mmu-miR-500-3p;MI0004702_1 0.00 - 21