Investigate support for running SpliceAi #786

lynnpais · 2024-05-14T15:01:15Z

Package available to run with tensor flow.

bpblanken · 2024-05-14T22:38:55Z

Some early notes:

Was able to get the command line tool running:

pip install spliceai tensorflow
cat v03_pipeline/var/test/callsets/1kg_30variants.vcf| spliceai -R vep_data/hg19.fa -A grch37

We have a couple of options:

Try to hack spliceai into hail's VEP call (which has a hail table -> stdout -> command execution -> hail table) setup (the least work but the most brittle).
Do something similar to what we've done with the clingen allele registry and manage the hail export, shell exec, vcf parse, and hail import ourselves.
- The main concern here is that the performance is quite bad. I'm seeing about 20 variants/s per worker when running locally on a vcf, in comparison with 150 variants/s per worker for VEP.
read and digest the spliceai source and try to run the Keras models in batch over a variant list rather than one-by-one. Just spitballing, I'd guess that this is at least 2 weeks of work for me on its own, with maybe an 80% chance of succeeding.

Regardless, we should read more in depth/have a convo with BenW about the bug fixes and changes he's made to the spliceai source on his fork.

lynnpais assigned bpblanken May 14, 2024

Provide feedback