You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using cellSNP is a prior step to vireo and it works very well. But it is incredibly slow and I think there's a lot of room for speeding it up. I don't program in Python and don't have much experience profiling Python code, but I think a lot of time is wasted in
There's overhead to opening a bam file, and overhead to accessing references. This is done once per SNP, so the overhead is dramatic for say the 1kGenome SNPs. A quick fix can be to change
if chrom != None:
if chrom not in samFile.references:
if chrom.startswith("chr"):
chrom = chrom.split("chr")[1]
else:
chrom = "chr" + chrom
if chrom not in samFile.references:
print("Can't find references %s in samFile" %chrom)
return None, None
to
if chrom != None:
sam_refs = samFile.references
if chrom not in sam_refs:
if chrom.startswith("chr"):
chrom = chrom.split("chr")[1]
else:
chrom = "chr" + chrom
if chrom not in sam_refs:
print("Can't find references %s in samFile" %chrom)
return None, None
Which in a small test case I have with 5000 cells, 20000 reads and with 100,000 SNPs resulted in a 45% speedup. Another more dramatic speedup is to avoid re-opening the BAM files over and over, I played around with creating a global that stores the currently open filename, chromosome and file object. Then returning the global if the request is identical
def check_pysam_chrom(samFile, chrom=None):
"""Chech if samFile is a file name or pysam object, and if chrom format.
"""
global CHROM_CACHE
global CHROM_CACHE_ID
global CHROM_CHACH_SAMFILE
CHROM_CACHE_SAMFILE = samFile
if CHROM_CACHE is not None:
if (samFile == CHROM_CACHE_SAMFILE) and (chrom == CHROM_CACHE_ID):
return CHROM_CACHE, CHROM_CACHE_ID
if type(samFile) == str:
ftype = samFile.split(".")[-1]
if ftype != "bam" and ftype != "sam" and ftype != "cram" :
print("Error: file type need suffix of bam, sam or cram.")
sys.exit(1)
if ftype == "cram":
samFile = pysam.AlignmentFile(samFile, "rc")
elif ftype == "bam":
samFile = pysam.AlignmentFile(samFile, "rb")
else:
samFile = pysam.AlignmentFile(samFile, "r")
if chrom != None:
if chrom not in samFile.references:
if chrom.startswith("chr"):
chrom = chrom.split("chr")[1]
else:
chrom = "chr" + chrom
if chrom not in samFile.references:
print("Can't find references %s in samFile" %chrom)
return None, None
CHROM_CACHE = samFile
CHROM_CACHE_ID = chrom
return samFile, chrom
This results in a 3.5x speedup under a single thread, but I don't know how the global is handled under multiple threads so I cannot recommend this solution.
The last thing I tried, was to remove the top level thread pools and push the threading down to the lower level by using
The suggestion on check_pysam_chrom looks very cool! I will have a test and take it. Multiple thresholds at the top level seems just fine at the moment, as lower level may slow down due to too many close and open for thresholds (my intuition), but thanks for testing it and letting me know.
I am using cellSNP is a prior step to vireo and it works very well. But it is incredibly slow and I think there's a lot of room for speeding it up. I don't program in Python and don't have much experience profiling Python code, but I think a lot of time is wasted in
https://github.com/huangyh09/cellSNP/blob/master/cellSNP/utils/pileup_utils.py#L44-L68
There's overhead to opening a bam file, and overhead to accessing references. This is done once per SNP, so the overhead is dramatic for say the 1kGenome SNPs. A quick fix can be to change
to
Which in a small test case I have with 5000 cells, 20000 reads and with 100,000 SNPs resulted in a 45% speedup. Another more dramatic speedup is to avoid re-opening the BAM files over and over, I played around with creating a global that stores the currently open filename, chromosome and file object. Then returning the global if the request is identical
This results in a 3.5x speedup under a single thread, but I don't know how the global is handled under multiple threads so I cannot recommend this solution.
The last thing I tried, was to remove the top level thread pools and push the threading down to the lower level by using
however this resulted in negative performance impacts, for me it was slower on all thread counts higher than 1.
The text was updated successfully, but these errors were encountered: