Improve Wilms Tumor Dataset Annotation (SCPCP000006) - explore `predicted.score` and `has_cnv.score` thresholds #856

maud-p · 2024-11-05T21:32:05Z

If you are filing this issue based on a specific GitHub Discussion, please link to the relevant Discussion.

This issue follows the PR #844 and the 2 comments:

Describe the goals of the changes to the analysis module.

I would like to explore difefrent thresholds for filtering and annotating based on the predicted.score and cnv.score.
I would like to:

improve the umpa reduction visualization with a 2-colors plot showing only one annotation and the rest in grey.
look at the distribution of predicted.score for each of the predicted.compartment and predicted.cell_type. So far, we only used the predicted.score to select normal cells (i.e. endothelial and immune cells), but don't use it to filter out cells with very low confident annotation (label as unknown).
render few notebook with a cnv_threshold of 0, 1 or 2 and evaluate the identification of normal cells. I'd like to check the distribution of the predicted.score of endothelial, immune, normal kidney and normal stroma cells using each of the threshold. It can be that, due to false positive cnv, normal cells showed some infered cnv. If this is the case, we should expect to recover more normal cells with high predicted.score using higher cnv_threshold.

What will your pull request contain?

Few changes in the 07 notebook

Will you require additional software beyond what is already in the analysis module?

No response

Will you require different computational resources beyond what the analysis module already uses?

No response

If known, when do you expect to file the pull request?

~ November

The text was updated successfully, but these errors were encountered:

sjspielman · 2024-11-05T22:04:41Z

Hi @maud-p, glad to see you back here in issues! I wanted to give you a heads up about continuing this module - I am still working behind the scenes on your module to get it all running in CI. I have updated the label transfer code but it's not yet merged into main (but will be within the next 2 weeks I think 🤞), since I am still working in a separate branch to fix some bugs we are now able to find with all code running in CI. You can see code as we work on it in this branch: https://github.com/AlexsLemonade/OpenScPCA-analysis/tree/feature/wilms-tumor-06-azimuth. While I am still working in my fork, rather than sending PRs to main, I am sending them here. Once this is entirely finished, we'll merge that branch into main.

FYI - one silly (!!!) bug I found is that somehow we never actually applied the score threshold in inferCNV - woops!! So as part of this, I am making sure we use the threshold in that script too!

I think that working on the module while I am still doing this will result in _a lot_of conflicts which will be very challenging to resolve. Also, the results will slightly change because of the new label transfer code, and the actual use of the 0.85 threshold in inferCNV, which will also complicate interpretation and validation. Are you able to wait a few weeks before doing these additional analyses? I will certainly keep you updated as I continue this process!

maud-p · 2024-11-05T22:21:27Z

@sjspielman thank you for all your efforts in making the analysis run in CI! I understand and I can wait, no problem at all!
No rush from my side. I just opened the issues to inform you about the plans and coordinate with you the next steps.
Just let me know if/how I can help and when I can start working on the analysis again 😃
Thank you !

sjspielman · 2024-12-03T18:41:50Z

Hi @maud-p !

We're all done working to update your module to ensure it runs smoothly through CI. We've made a decent number of changes to the module workflow. Here are the most important ones to be aware of:

The workflow is now run with a shell (not R) script, 00_run_workflow.sh. This is because, it turns out, when running system() from an R script, the script will not fail when an individual step fails. This means errors don't get caught. So, we switched to a shell script.
1. When adding new steps to be run in CI, they should go in here. If you are newer to working with Bash vs R, we're happy to help get lines in there when the time comes.
2. This script no longer runs explore-cnv-methods.sh (also converted from an R script to shell script) since these steps were exploratory, do not contribute to the final annotations, and take a very long time to re-run.
Some notebook reorganization: The copykat & infercnv exploration notebooks as well as the notebook to characterize the Stewart reference are now stored in supplemental_notebooks. This directory can be used for notebooks that are not necessarily part of the analysis pipeline, but we run for exploratory reasons. This directory also has a notebook I wrote to ensure the new label transfer results are consistent with previous results obtained directly from Azimuth.
We have re-generated all notebooks with the most recent data release (2024-11-25). But, because I don't have permission to upload results to your bucket to share with you, you will want to re-run the workflow yourself to generate updated result files locally. You can do this with bash 00_run_workflow.sh. Please make sure to download the most recent data release before you do this.

It's worth noting a couple changes to how cell typing is currently performed, since this may be something you want to look into more: Several of the samples do not have a reliable set of normal cells to use as a reference for inferCNV, so these are run with no reference. You can see this here:

OpenScPCA-analysis/analyses/cell-type-wilms-tumor-06/00_run_workflow.sh

Lines 139 to 150 in 0e826e7

    
           # These samples do not have sufficient normal cells to run with a reference in infercnv 
        
           samples_no_reference=("SCPCS000177" "SCPCS000180" "SCPCS000181" "SCPCS000190" "SCPCS000197") 
        
           # Define inferCNV reference set 
        
           if [[ " ${samples_no_reference[*]} " =~ " ${sample_id} " ]]; then 
        
             reference="none" 
        
           else 
        
             reference="both" 
        
           fi 
        
           # Run inferCNV 
        
           Rscript scripts/06_infercnv.R --sample_id $sample_id --reference $reference --HMM i3 ${test_string}

. This approach allows code to run in CI, but you may have a different idea for how you'd prefer to treat these samples, so I wanted to point this out in particular!

Before you return to analysis (if you're still interested!), I recommend you take a little time to look over how the module now looks, let us know what questions you have! Let us know if we can help you sync back up too, in case of conflicts when you pull into your fork. Also, FYI - we will be teaching a workshop the week of December 9th, so we may be slower to respond during those few days.

maud-p · 2024-12-03T20:55:29Z

Hi @sjspielman ,

Thank you so much! From a first rapid look, this all seems like great changes and really well organized, so thank you so much!

I am still willing to continue working on the dataset! I'll try to play with the threshold parameters in the last step 07_combined_annotation_across_samples_exploration.Rmd and use the different annotations to find marker genes for each of the 2nd level annotation. FYI, ideally, I'd then like to validate them in patient samples with IF or IHC staining.

I'll add the differential expression analysis to find the marker genes candidate in the module, if you think it can be a good add on, and then we can discuss further how you like to follow the IF/IHC validation, (if you like to!).

Time-wise, I plan to focus on that analysis starting from January 2025, I am afraid I won't make it before Christmas.

Thank you again, looking forward the next steps!

sjspielman · 2024-12-03T20:59:20Z

Time-wise, I plan to focus on that analysis starting from January 2025, I am afraid I won't make it before Christmas.

Absolutely, this analysis module is all yours! It's here to work on whenever you are able; no rush on our end :) Enjoy your holiday season!!

maud-p added the analysis label Nov 5, 2024

maud-p mentioned this issue Jan 17, 2025

improve_07_annotation #994

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Wilms Tumor Dataset Annotation (SCPCP000006) - explore `predicted.score` and `has_cnv.score` thresholds #856

Improve Wilms Tumor Dataset Annotation (SCPCP000006) - explore `predicted.score` and `has_cnv.score` thresholds #856

maud-p commented Nov 5, 2024

sjspielman commented Nov 5, 2024 •

edited

Loading

maud-p commented Nov 5, 2024

sjspielman commented Dec 3, 2024

maud-p commented Dec 3, 2024

sjspielman commented Dec 3, 2024

Improve Wilms Tumor Dataset Annotation (SCPCP000006) - explore predicted.score and has_cnv.score thresholds #856

Improve Wilms Tumor Dataset Annotation (SCPCP000006) - explore predicted.score and has_cnv.score thresholds #856

Comments

maud-p commented Nov 5, 2024

If you are filing this issue based on a specific GitHub Discussion, please link to the relevant Discussion.

Describe the goals of the changes to the analysis module.

What will your pull request contain?

Will you require additional software beyond what is already in the analysis module?

Will you require different computational resources beyond what the analysis module already uses?

If known, when do you expect to file the pull request?

sjspielman commented Nov 5, 2024 • edited Loading

maud-p commented Nov 5, 2024

sjspielman commented Dec 3, 2024

maud-p commented Dec 3, 2024

sjspielman commented Dec 3, 2024

Improve Wilms Tumor Dataset Annotation (SCPCP000006) - explore `predicted.score` and `has_cnv.score` thresholds #856

Improve Wilms Tumor Dataset Annotation (SCPCP000006) - explore `predicted.score` and `has_cnv.score` thresholds #856

sjspielman commented Nov 5, 2024 •

edited

Loading