Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration of the SCPCP000006 Wilms tumor dataset #857

Open
3 tasks
maud-p opened this issue Nov 5, 2024 · 7 comments
Open
3 tasks

Integration of the SCPCP000006 Wilms tumor dataset #857

maud-p opened this issue Nov 5, 2024 · 7 comments
Labels

Comments

@maud-p
Copy link
Contributor

maud-p commented Nov 5, 2024

If you are filing this issue based on a specific GitHub Discussion, please link to the relevant Discussion.

This issue is related to the steps 6 and 7 I described in my proposed analysis #635 (comment)

Describe the goals of the changes to the analysis module.

Step 6 – validation by integration of the 40 samples
I would like to integrate the 40 snRNA-Seq using scVI or harmony, perform dimensional reduction and clustering. This will allow to validate our annotations, as cells from the same cell type should cluster together. At the sample level, normal and cancer cells fro the same histology cluster together, this might not be the case in the integrated dataset (hopefully 🤞 ).

Step 7 – identification of marker genes for each cell subtype using differential expression analysis
Finally, we would like to provide the WT community with specific and universal marker genes for a rapid identification of the different cell types found within the tumors. To do so, we will use pseudobulk differential expression analyses (DElegate package) to find markers of the different cell types using the function FindAllMarkers2 (default parameters, patient as replicate). We would like to even further validate candidate Wilms tumor marker genes in the VISIUM data and/or in FFPE sample (IHC) and in vitro models (IF).

Additionally, we could compare relapse and non-relapse samples per cell type using the function findDE (replicate_column = "patient", method = “edger”) to evaluate if a specific phenotype within the cancer cells or the microenvironment could indicate relapse in WT.

What will your pull request contain?

  • a script for the integration of the 40 samples
  • a notebook for the exploration of the clustering, marker genes
  • notebooks performing and exploring differential expression analyses (normal versus cancer, histologies between them, relapse versus non-relapse, etc). Idea would be one notebook per question (and PR!) 💡

Will you require additional software beyond what is already in the analysis module?

scvi integration requires conda environment.

Will you require different computational resources beyond what the analysis module already uses?

The integration of the 40 samples will require quite some ressource and might not run in cli.

If known, when do you expect to file the pull request?

I have quite some wet lab work pending and I am not sure when I'll be able to focus on the described follow-up analysis, maybe somewhere in December.

But I like to do these analyses, that will hopefully allow improved marker identification of cancer versus normal and for specific histological subtypes (epithelial, blastemal, stromal), which is crucial for our future research and would be valuable for the Wilms tumor community.

Part of the work might be out of the scope of the Open-ScPCA project, happy to discuss with you if/how you like to continue the analysis!

@maud-p maud-p added the analysis label Nov 5, 2024
@maud-p
Copy link
Contributor Author

maud-p commented Jan 14, 2025

Dear DataLab team,

I am re-opening this issue as I am working on the integration fo the Wilms tumor dataset now 😄

Regarding scvi-tools, do you maybe have some experience with running it in a R-based docker container?
https://docs.scvi-tools.org/en/0.19.0/installation.html

Ideally, I would like to minimally modify the Dockerfile to allow the use of scVIIntegration with [IntegrateLayers](https://satijalab.org/seurat/reference/integratelayers), as shown in this vignette.

I saw that you used a conda environment for zellkonverter and was wondering if an adaptation of the Dockerfile as you did here at the end could work? I am not really familiar with this, so in case you have any example, it could help a lot 😄 maybe @sjspielman or @jashapiro ?

Thank you in advance!

@sjspielman
Copy link
Member

Hi @maud-p! I have a few thoughts/questions for you in response -

  1. Can you give us a sense of why you are specifically interested in using scvi-tools and not other integration tools like harmony (I'll also suggest fastMNN here from the Bioconductor batchelor package, which we've had some decent experience with!)?
    1. The reason I ask is getting set up to work with Python-based scvi-tools via R-based Seurat will probably be challenging, since there will be a lot of different environments and dependencies that have to work together. Figuring out how to make all the moving parts work together will likely take some time, and if you can achieve your results with a strictly R-based tool like harmony or fastMNN that will be much more straight-forward.
  2. Generally speaking, I want to point out that using packages/methods on their own rather than using them via a different package is usually going to provide you both with more flexibility and code stability. In this specific case, what I mean here is using integration methods directly and not calling them from Seurat:
    1. First, harmony - I would encourage you to use the harmony package directly, for example, rather than calling it from Seurat. harmony(and fastMNN for that matter!) takes as input a PCA matrix, so these methods are flexible for many formats of input data
    2. Second, scvi-tools - if you absolutely need to use this package, we should discuss other strategies for using it that do not require you to call it from Seurat: either directly via a Python script (which means you'd have to save objects as AnnData files to input them), or perhaps via reticulate within R. Notably, I have only ever used scvi-tools in Python directly, never via R, but I suspect that the environment setup with R may not be straightforward to set up to work across computing systems.

@jashapiro
Copy link
Member

@maud-p One other comment here in relation to setting up scvi-tools in the Docker image: One potential issue is that the image with both a full R install with many packages and scvi-tools with its many dependencies in Python is that the Docker image can get very large. This turns out to be a problem when trying to build the images in our automated system. Which is to say we might have to look at how to trim down the image as part of adding scvi-tools while keeping everything working, which adds another level of challenge!

@maud-p
Copy link
Contributor Author

maud-p commented Jan 16, 2025

Dear @sjspielman , dear @jashapiro ,

Thank you so much for your replies and explanations!

I don't want to add much more work and challenges on your plate, so I think I'll try to make it without scvi-tools.

The reason why I wanted to try it, is that I was very surprised with the integrated data using either RPCAIntegration or HarmonyIntegration in Seurat. Basically, normal and cancer epithelial (resp. stromal) cells overlap. I am not sure if this is a biological reality (which could really be), or a problem of overintegration.

I will clean my code and open a PR related to this issue, so maybe we can improve things together.

I also came out with another idea to test (hopefully validate) the annotation workflow, running it for the Wilms tumor 14 dataset, which contains paired tumor samples and O-PDX. Making the hypothesis that O-PDX shouldn't contain any normal cells from the patient (human), we could check like this the annotated normal/cancer cells. I'll open a new issue to discuss more about it!

Thank you!

@allyhawkins
Copy link
Member

Hi @maud-p, I'm Ally, one of the other data scientists at ALSF. I just wanted to chime in a little since I did some work a while ago testing different integration methods.

The reason why I wanted to try it, is that I was very surprised with the integrated data using either RPCAIntegration or HarmonyIntegration in Seurat. Basically, normal and cancer epithelial (resp. stromal) cells overlap. I am not sure if this is a biological reality (which could really be), or a problem of overintegration.

What do you mean they overlap? I'm assuming this means on the UMAP you see the cell types close to each other, but if you re-cluster the integrated results do you see that cell types of different types are in different clusters or do they belong to the same cluster? One good metric that we've used previously is the cLISI, which measures how close each cell is to other cells that belong to the same cell type in the PCA. We've used this before in our work, and here's a function we wrote a while ago that might help guide you if you did want to use this metric. Also happy to talk more about using integration metrics in general that could help in identifying if you have "good" integration.

That being said, we've definitely seen that the Seurat methods can lead to over integration! So this doesn't surprise me at all. I totally agree that scvi-tools is probably going to give you better integration results, but it is much more computationally intensive and more difficult to set up as others have pointed out here (if you were to use it, I would agree that you should use the AnnData files and set it up directly in Python). I think I would first try fastMNN, which you can also use with Seurat before trying scvi. Another thing I might do is play around with parameters with fastMNN and Harmony, such as the integration order.

@sjspielman
Copy link
Member

To jump off of what @allyhawkins wrote, I'll also point out (though this may not make a huge difference) that the harmony version that Seurat is currently using is up to date with harmony itself, so I would consider results that directly use the harmony package to be more reliable. Also, harmony itself contains more features and options compared to what Seurat allows you to specify, so there is much more flexibility to fine-tune.

@maud-p
Copy link
Contributor Author

maud-p commented Jan 16, 2025

Hi @allyhawkins , @sjspielman ,

Thank you so much, it all makes lot of sense! I'll try the fastMNN and Harmony with/without Seurat and see if I can improve things. I'll keep you updated with a PR, hopefully begining of February (I'll be on vacation starting from next week).

To answer your question @allyhawkins , I have clusters mixed with different cell types. Maybe before the PR, just to illustrate how it looks like with HarmonyIntegration:

Image

Globally I'd it is not too bad, but the annotated normal cells (kidney in green and normal stroma in turkis) are spread all over the umap without specific clusters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants