Optimisation suggestions #1

Shians · 2019-03-03T00:07:00Z

I am using cellSNP is a prior step to vireo and it works very well. But it is incredibly slow and I think there's a lot of room for speeding it up. I don't program in Python and don't have much experience profiling Python code, but I think a lot of time is wasted in

https://github.com/huangyh09/cellSNP/blob/master/cellSNP/utils/pileup_utils.py#L44-L68

There's overhead to opening a bam file, and overhead to accessing references. This is done once per SNP, so the overhead is dramatic for say the 1kGenome SNPs. A quick fix can be to change

if chrom != None:
    if chrom not in samFile.references:
        if chrom.startswith("chr"):
            chrom = chrom.split("chr")[1]
        else:
            chrom = "chr" + chrom
    if chrom not in samFile.references:
        print("Can't find references %s in samFile" %chrom)
        return None, None

to

if chrom != None:
    sam_refs = samFile.references
    if chrom not in sam_refs:
        if chrom.startswith("chr"):
            chrom = chrom.split("chr")[1]
        else:
            chrom = "chr" + chrom
    if chrom not in sam_refs:
        print("Can't find references %s in samFile" %chrom)
        return None, None

Which in a small test case I have with 5000 cells, 20000 reads and with 100,000 SNPs resulted in a 45% speedup. Another more dramatic speedup is to avoid re-opening the BAM files over and over, I played around with creating a global that stores the currently open filename, chromosome and file object. Then returning the global if the request is identical

def check_pysam_chrom(samFile, chrom=None):
    """Chech if samFile is a file name or pysam object, and if chrom format. 
    """
    global CHROM_CACHE
    global CHROM_CACHE_ID
    global CHROM_CHACH_SAMFILE

    CHROM_CACHE_SAMFILE = samFile

    if CHROM_CACHE is not None:
        if (samFile == CHROM_CACHE_SAMFILE) and (chrom == CHROM_CACHE_ID):
            return CHROM_CACHE, CHROM_CACHE_ID

    if type(samFile) == str:
        ftype = samFile.split(".")[-1]
        if ftype != "bam" and ftype != "sam" and ftype != "cram" :
            print("Error: file type need suffix of bam, sam or cram.")
            sys.exit(1)
        if ftype == "cram":
            samFile = pysam.AlignmentFile(samFile, "rc")
        elif ftype == "bam":
            samFile = pysam.AlignmentFile(samFile, "rb")
        else:
            samFile = pysam.AlignmentFile(samFile, "r")

    if chrom != None:
        if chrom not in samFile.references:
            if chrom.startswith("chr"):
                chrom = chrom.split("chr")[1]
            else:
                chrom = "chr" + chrom
        if chrom not in samFile.references:
            print("Can't find references %s in samFile" %chrom)
            return None, None
    
    CHROM_CACHE = samFile
    CHROM_CACHE_ID = chrom
    return samFile, chrom

This results in a 3.5x speedup under a single thread, but I don't know how the global is handled under multiple threads so I cannot recommend this solution.

The last thing I tried, was to remove the top level thread pools and push the threading down to the lower level by using

samFile = pysam.AlignmentFile(samFile, "rb", threads = nproc)

however this resulted in negative performance impacts, for me it was slower on all thread counts higher than 1.

The text was updated successfully, but these errors were encountered:

huangyh09 · 2019-03-03T09:20:31Z

Hi Shians,

The suggestion on check_pysam_chrom looks very cool! I will have a test and take it. Multiple thresholds at the top level seems just fine at the moment, as lower level may slow down due to too many close and open for thresholds (my intuition), but thanks for testing it and letting me know.

Cheers
Yuanhua

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimisation suggestions #1

Optimisation suggestions #1

Shians commented Mar 3, 2019

huangyh09 commented Mar 3, 2019

Optimisation suggestions #1

Optimisation suggestions #1

Comments

Shians commented Mar 3, 2019

huangyh09 commented Mar 3, 2019