CITE-Seq-Count V2 path! #182

Hoohm · 2023-08-27T10:11:09Z

Hoohm
Aug 27, 2023
Maintainer

Ahoy everyone!

CITE-Seq-Count is now getting old and has not been actively maintained for years. This is mostly because my time has become more limited after I was done with my PhD.

I've been diving into polars recently and I'd like to use it to improve CSC!

This change would mean a huge increase in speed and better memory management. Two concerns that have grown over the years as people use bigger and bigger libraries.

Using polars in the backend comes with two restrictions.

The tags file cannot hold sequence with multiple lengths.
there would not be a sliding window mode

Illustration

This would no longer work and you would need to run the mapper twice. Once for each protein.

feature_name,sequence
prot1,TAGCTGATCA
prot2,TATTGC

This also means that it would become a bit more difficult to troubleshoot your library.

The sliding window mode offers the possibility to map tag sequences on the R2 read when you don't have a consistent insertion position. This is probably happens only on very custom libraries or bad libraries.

R2_sequence: CGTAGCATCGGCTAGCTGCAGCTAGTGCTAGCC
TAG_sequence: ATCGGCTAGCTG

This means that cases like the one below would only map if you provide the staring index of the expected position of the tag on your R2 sequence.

CGTAGCATCGGCTAGCTGCAGCTAGTGCTAGCC
------ATCGGCTAGCTG---------------

From the issues being addressed on the repo, I doubt this feature is very popular anyways, so I'm not that worried about cutting it.

I'm really keen on trying this out but I need to know if that change is a deal breaker from the community.

Thank you

Patrick Rölli

What's your opinion on this big change?

I need both of these features, don't change anything

0%

I need the multi-length tags feature

0%

I need the sliding window feature

0%

I can get around them, please make it faster

100%

3 votes

Hoohm · 2023-08-27T10:16:55Z

Hoohm
Aug 27, 2023
Maintainer Author

@cpflueger2016 tagging you here since you've been one of the most active contributor to the issues over the years!

I highly value your opinion on this!

5 replies

cpflueger2016 Aug 29, 2023

I’m very honored @Hoohm ! The way I see this is that speed and scale is critical to move forward. For the rare cases (such as mine) where I needed a sliding window or different length tags, people can use the current version if need be.
I’d be really keen to see how the “polars” implementation would scale and how fast it could run. Thanks for doing this Patrick! We’re super appreciative of all your hard work and massive efforts ❤️👍💪🙌

Hoohm Oct 7, 2023
Maintainer Author

Great!

I'm moving slowly but the mapper is done!
Next up are barcode corrections and UMI corrections.

cpflueger2016 Oct 30, 2023

Let me know how you go. I have a couple of very large RNAflex libraries that I would love to parse with CiteSeqTools, especially in context of the readCounts (prior to UMI deduplication).

Hoohm Oct 31, 2023
Maintainer Author

I can probably help you there.
I've just pushed the branch.

Everything is broken, but what you need already works.

This is based of python 3.11 using polars 0.19.12

I can help you a bit more this weekend probably to have the output you need.

pjvandam88 Dec 18, 2023

Dear @Hoohm ,

For starters, many thanks for having developed and maintained CITE-Seq-Count!

As I am also looking into CSC to process a very large dataset, I am really looking forward to the implementation of polars. I was wondering if there is some kind of to do list of all the tasks that need to be performed on this branch. If possible and if time would permit, I would like to help out as well.

Hoohm · 2023-12-25T15:47:25Z

Hoohm
Dec 25, 2023
Maintainer Author

Thank you for your support! I'm happily detailing what needs to be done.

General Design

The main focus on the new version should be to totally rebuild the backend using polars as much as possible. Mains reasons are speed, memory usage and parallelism. Polars opens all those aspect up for almost no cost except that everything is a dataframe now.

Parallelism & Speed

Polars provides two main objects to work with, DataFrame and LazyFrame. The lazy approach allows Polars to optimize code path so the aim is to make use of lazy frames as much as possible. Today most of the code is still mostly DataFrame, mostly because it's more convenient to test and code, but the lazy approach should be used wherever possible.

Memory usage

Again Polars is doing a lot of the heavy lifting as it will use whatever memory it needs to go through the processing. That being said, Polars is memory hungry. This can be mitigated by using lazy frames and the streaming interface in conjunction with lazy frames. This should allow CSC to run even if memory size is small. The streaming option is very easy to turn on so it probably should be an option in the CLI.

Everything is a `DataFrame`/`LazyFrame` (df)

The main idea behind the rework is to store everything that can be in a df. To optimize speed and memory usage, I'm aiming to have the least amount of df in memory and use joins to merge the existing dfs. Today, all of the inputs that can be are now read as dfs. In terms of outputs, I want to keep the mtxfiles as is and keep the optional csv output. There will be a new parquet output though.

Tasks to work on

Cell Barcode Correction

The implementation in CSCS 1.4 generates a kbtree object based on the provided whitelist. It generates all possible barcodes with edit distance one, then goes through all the barcodes in the dataset that have not been matched to one from the whitelist to the extended kbtree object. This is fairly slow, especially if we provide the full 3M or 6M as a whitelist reference.
The current polars version tries to use join_asof using the latest version that allows a string key. The main problem today is that the polars implementation only provides two out of three strategies. forward or backward, not nearest (which would be the easiest). I'm not confident in my understanding of the asof_join so I'm still testing around to check weather I can use it in this case. I'm open to any other implementation here that uses polars dfs to find barcodes that are 1 mutation away from a reference barcode.
The current path I'm exploring is to use both strategies, one after the other to find the actual closest barcode because using only one returns the wrong one.

UMI correction

This is still the slowest part and the most memory hungry in CSC 1.4
I'm using umi_tools package. I don't have a polars only solution for this part. The idea would be to use map_elements from polars to use the same function as today: umi_clusters = network.UMIClusterer() found in processing.correct_umis_in_cells

Outputs

The latest branch switched the main dict object that holds all the mapped information by a polars df. This means that all the downstream functions are broken as they expect a dict not a polars df. All those functions need to be compatible again to the df objects.

I hope this quick introduction helps a bit. I understand that this is not enough details to contribute. Is there any task that calls you? I can then write some tests specific to that task to enable you to work on it.

Merry Christmas and a Happy New Year!

3 replies

pjvandam88 Dec 26, 2023

Hi Patrick,

Thanks for the introduction and the general point of view. This really helps.
A couple of things that come in mind.

I think we would need to have some kind of task list of all the processes that still need adaptation. Especially if we will start working together, it will help to see where we are. I don't know if it should be here or directly in the PR. I will let you decide.
Based on your overview, speed and parallelism is the main focus which I like a lot. Using Rust and Polars is a great way to do this. I am completely on board. Regarding the join_asof method, I don't know whether this would be the best way since the nearest method is indeed not yet implemented. Are you aware of the PyO3 package: https://pyo3.rs/v0.20.0/. In short, this allows running Rust code from inside Python functions. There are multiple bk tree implementations in Rust (https://github.com/eugene-bulkin/rust-bk-tree). Maybe porting that specific piece to Rust will also speed up the implementation. We will need to benchmark though. When doing it this way, we can implement the same logic, first join using Polars and run the kb tree on the ones that have mismatches.
I agree that continuing using umi_tools would be the best. Just had a look at the map_elements function and this does not parallelize. We could use the multiprocessing module to also speed this up with the number of cores it can use.
Regarding the outputs, it would be helpful once again to have a list of all the functions that need adaptation and are already adapted. I can also go to the code and start doing this if needed, just let me know.

Hoohm Dec 28, 2023
Maintainer Author

Hello!

I'll add more details to the to-do list of tasks that is at the beginning of the PR.
I've updated the barcode correction with using both strategies and I think it works. It also should be faster and less memory hungry than the kbtree method. You can have a look at the latest commit, this is the function that I'm talking about: correct_barcodes_pl . I've discovered recently that polars has a discord. Following that I've found that folks are starting to implement polars "plugins". Here is an example of one that might be very helpful: https://github.com/ion-elgreco/polars-distance. I'm using a python function atm to check hamming distance in correct_barcodes_pl that might be faster using the plugin. Will benchmark this.
I did use multiprocessing in the past for the umi correction step. It does go faster but memory consumption explodes. Some folks were struggling with bigger datasets. I'm also a bit worries about the behavior of multiple polars instances running at the same time on the same machine as I don't know how it will use resources but we could try it out.
I can do that no problems. Will add to the to-do list on the PR. Will also define schemas for the polars tables to help out.

I should have some hours this week to get those done :)

Thank you for your input and help

Hoohm Dec 28, 2023
Maintainer Author

Forgot to mention. I'm not comfortable writing rust code and using pyo3. Not against it either but if I were to propose a rewrite of some functionality, for example the umi_tools UMI corrector, I would prefer to try and rewrite it all in polars for example. Feels like the polars ecosystem would be a perfect place to make those conversions. But if you prefer a full rust implementation, I'm not against it.

Hoohm · 2023-12-30T15:26:15Z

Hoohm
Dec 30, 2023
Maintainer Author

@pjvandam88 Hello!

I've made some progress on the asks and on the code side of things.
You should be able to run CSC from this branch now but there is no UMI correction for now.

I've detailed tasks on the PR directly

0 replies

Hoohm · 2024-01-03T12:36:53Z

Hoohm
Jan 3, 2024
Maintainer Author

@pjvandam88 The branch has now a first attempt at UMI correction. I don't use the "polars" interface but it runs through.

I've done a quick benchmark comparing v1.4.5 vs v2.0.0 for 11M reads with around 1000 Cells. It took 1 min on v2 vs 3min for v1.4.5 with a slight reduction in memory.

Gonna compare the outputs now to see if the counts are consistent.

0 replies

Hoohm · 2024-01-04T11:28:36Z

Hoohm
Jan 4, 2024
Maintainer Author

Getting into the last cleanups and checks.
Here a read counts comparisons:

Here a umi counts comparisons:

Looking good!
Small differences but nothing alarming.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CITE-Seq-Count V2 path! #182

{{title}}

Replies: 5 comments 8 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

CITE-Seq-Count V2 path! #182

Hoohm Aug 27, 2023 Maintainer

Ahoy everyone!

Replies: 5 comments · 8 replies

Hoohm Aug 27, 2023 Maintainer Author

cpflueger2016 Aug 29, 2023

Hoohm Oct 7, 2023 Maintainer Author

cpflueger2016 Oct 30, 2023

Hoohm Oct 31, 2023 Maintainer Author

pjvandam88 Dec 18, 2023

Hoohm Dec 25, 2023 Maintainer Author

General Design

Parallelism & Speed

Memory usage

Everything is a DataFrame/LazyFrame (df)

Tasks to work on

Cell Barcode Correction

UMI correction

Outputs

pjvandam88 Dec 26, 2023

Hoohm Dec 28, 2023 Maintainer Author

Hoohm Dec 28, 2023 Maintainer Author

Hoohm Dec 30, 2023 Maintainer Author

Hoohm Jan 3, 2024 Maintainer Author

Hoohm Jan 4, 2024 Maintainer Author

Hoohm
Aug 27, 2023
Maintainer

Replies: 5 comments 8 replies

Hoohm
Aug 27, 2023
Maintainer Author

Hoohm Oct 7, 2023
Maintainer Author

Hoohm Oct 31, 2023
Maintainer Author

Hoohm
Dec 25, 2023
Maintainer Author

Everything is a `DataFrame`/`LazyFrame` (df)

Hoohm Dec 28, 2023
Maintainer Author

Hoohm Dec 28, 2023
Maintainer Author

Hoohm
Dec 30, 2023
Maintainer Author

Hoohm
Jan 3, 2024
Maintainer Author

Hoohm
Jan 4, 2024
Maintainer Author