Replies: 5 comments 8 replies
-
@cpflueger2016 tagging you here since you've been one of the most active contributor to the issues over the years! I highly value your opinion on this! |
Beta Was this translation helpful? Give feedback.
-
Thank you for your support! I'm happily detailing what needs to be done. General DesignThe main focus on the new version should be to totally rebuild the backend using polars as much as possible. Mains reasons are speed, memory usage and parallelism. Polars opens all those aspect up for almost no cost except that everything is a dataframe now. Parallelism & SpeedPolars provides two main objects to work with, Memory usageAgain Polars is doing a lot of the heavy lifting as it will use whatever memory it needs to go through the processing. That being said, Polars is memory hungry. This can be mitigated by using lazy frames and the streaming interface in conjunction with lazy frames. This should allow CSC to run even if memory size is small. The streaming option is very easy to turn on so it probably should be an option in the CLI. Everything is a
|
Beta Was this translation helpful? Give feedback.
-
@pjvandam88 Hello! I've made some progress on the asks and on the code side of things. I've detailed tasks on the PR directly |
Beta Was this translation helpful? Give feedback.
-
@pjvandam88 The branch has now a first attempt at UMI correction. I don't use the "polars" interface but it runs through. I've done a quick benchmark comparing v1.4.5 vs v2.0.0 for 11M reads with around 1000 Cells. It took 1 min on v2 vs 3min for v1.4.5 with a slight reduction in memory. Gonna compare the outputs now to see if the counts are consistent. |
Beta Was this translation helpful? Give feedback.
-
Getting into the last cleanups and checks. Here a umi counts comparisons: Looking good! |
Beta Was this translation helpful? Give feedback.
-
Ahoy everyone!
CITE-Seq-Count is now getting old and has not been actively maintained for years. This is mostly because my time has become more limited after I was done with my PhD.
I've been diving into polars recently and I'd like to use it to improve CSC!
This change would mean a huge increase in speed and better memory management. Two concerns that have grown over the years as people use bigger and bigger libraries.
Using polars in the backend comes with two restrictions.
Illustration
This would no longer work and you would need to run the mapper twice. Once for each protein.
This also means that it would become a bit more difficult to troubleshoot your library.
The sliding window mode offers the possibility to map tag sequences on the R2 read when you don't have a consistent insertion position. This is probably happens only on very custom libraries or bad libraries.
This means that cases like the one below would only map if you provide the staring index of the expected position of the tag on your R2 sequence.
From the issues being addressed on the repo, I doubt this feature is very popular anyways, so I'm not that worried about cutting it.
I'm really keen on trying this out but I need to know if that change is a deal breaker from the community.
Thank you
Patrick Rölli
3 votes ·
Beta Was this translation helpful? Give feedback.
All reactions