Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimisation suggestions #1

Open
Shians opened this issue Mar 3, 2019 · 1 comment
Open

Optimisation suggestions #1

Shians opened this issue Mar 3, 2019 · 1 comment

Comments

@Shians
Copy link

Shians commented Mar 3, 2019

I am using cellSNP is a prior step to vireo and it works very well. But it is incredibly slow and I think there's a lot of room for speeding it up. I don't program in Python and don't have much experience profiling Python code, but I think a lot of time is wasted in

https://github.com/huangyh09/cellSNP/blob/master/cellSNP/utils/pileup_utils.py#L44-L68

There's overhead to opening a bam file, and overhead to accessing references. This is done once per SNP, so the overhead is dramatic for say the 1kGenome SNPs. A quick fix can be to change

if chrom != None:
    if chrom not in samFile.references:
        if chrom.startswith("chr"):
            chrom = chrom.split("chr")[1]
        else:
            chrom = "chr" + chrom
    if chrom not in samFile.references:
        print("Can't find references %s in samFile" %chrom)
        return None, None

to

if chrom != None:
    sam_refs = samFile.references
    if chrom not in sam_refs:
        if chrom.startswith("chr"):
            chrom = chrom.split("chr")[1]
        else:
            chrom = "chr" + chrom
    if chrom not in sam_refs:
        print("Can't find references %s in samFile" %chrom)
        return None, None

Which in a small test case I have with 5000 cells, 20000 reads and with 100,000 SNPs resulted in a 45% speedup. Another more dramatic speedup is to avoid re-opening the BAM files over and over, I played around with creating a global that stores the currently open filename, chromosome and file object. Then returning the global if the request is identical

def check_pysam_chrom(samFile, chrom=None):
    """Chech if samFile is a file name or pysam object, and if chrom format. 
    """
    global CHROM_CACHE
    global CHROM_CACHE_ID
    global CHROM_CHACH_SAMFILE

    CHROM_CACHE_SAMFILE = samFile

    if CHROM_CACHE is not None:
        if (samFile == CHROM_CACHE_SAMFILE) and (chrom == CHROM_CACHE_ID):
            return CHROM_CACHE, CHROM_CACHE_ID

    if type(samFile) == str:
        ftype = samFile.split(".")[-1]
        if ftype != "bam" and ftype != "sam" and ftype != "cram" :
            print("Error: file type need suffix of bam, sam or cram.")
            sys.exit(1)
        if ftype == "cram":
            samFile = pysam.AlignmentFile(samFile, "rc")
        elif ftype == "bam":
            samFile = pysam.AlignmentFile(samFile, "rb")
        else:
            samFile = pysam.AlignmentFile(samFile, "r")

    if chrom != None:
        if chrom not in samFile.references:
            if chrom.startswith("chr"):
                chrom = chrom.split("chr")[1]
            else:
                chrom = "chr" + chrom
        if chrom not in samFile.references:
            print("Can't find references %s in samFile" %chrom)
            return None, None
    
    CHROM_CACHE = samFile
    CHROM_CACHE_ID = chrom
    return samFile, chrom

This results in a 3.5x speedup under a single thread, but I don't know how the global is handled under multiple threads so I cannot recommend this solution.

The last thing I tried, was to remove the top level thread pools and push the threading down to the lower level by using

samFile = pysam.AlignmentFile(samFile, "rb", threads = nproc)

however this resulted in negative performance impacts, for me it was slower on all thread counts higher than 1.

@huangyh09
Copy link
Collaborator

Hi Shians,

The suggestion on check_pysam_chrom looks very cool! I will have a test and take it. Multiple thresholds at the top level seems just fine at the moment, as lower level may slow down due to too many close and open for thresholds (my intuition), but thanks for testing it and letting me know.

Cheers
Yuanhua

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants