-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should we reprocess all profiles before frozen data release? #62
Comments
@gwaygenomics I've updated the time estimate section in your top post. I'm not sure how long 2. will take, but if it's not too long, I propose we do 1 and 2, but not 3 (unless you think it's feasible for you to do it, given everything else going on) |
Sounds good. dvc will not take long (couple hours) i will use #63 to track 1 (I hope to get this running tomorrow) and will open a new PR for 2 |
If possible, it will be super helpful if you can add some notes for migrating/setting up |
in #61 (comment), I said:
Facing this now. @shntnu, do you have any historical knowledge about how these broad ids might have differed from the pilot? In n=1, one plate from batch 2 has only 13 MOAs matched in |
@tnat1031 do you happen to know the answer to this? The details below might help recap. library(tidyverse)
platemaps <-
c("https://raw.githubusercontent.com/broadinstitute/lincs-cell-painting/master/metadata/platemaps/2017_12_05_Batch2/platemap/ASG003_A549_24H.txt",
"https://raw.githubusercontent.com/broadinstitute/lincs-cell-painting/master/metadata/platemaps/2017_12_05_Batch2/platemap/LKCP001_A549_24H.txt",
"https://raw.githubusercontent.com/broadinstitute/lincs-cell-painting/master/metadata/platemaps/2017_12_05_Batch2/platemap/LKCP002_A549_24H.txt")
n_cell_lines <- 3
n_time_points <- 3
lkcp_broad_samples <-
platemaps %>%
map_df(read_tsv, col_types = cols()) %>%
distinct(broad_sample)
lkcp_broad_samples %>%
sample_n(10) %>%
knitr::kable()
lkcp_broad_samples %>%
count() %>%
knitr::kable()
Created on 2021-04-02 by the reprex package (v0.3.0) |
@shntnu @gwaygenomics If I recall correctly I think the batch 2 compounds (aka LKCP) were not explicitly chosen to overlap with the pilot compounds. Rather, it was an experiment designed to compare the L1000 and CP readouts with exactly the same conditions (compounds, cell lines, doses, replicates, time points exactly matched). |
thanks @tnat1031 - the specific question is if you know why the majority of these compounds do not align with CMAP broad ID annotations. Were they experimental compounds lacking MOA/target info? |
looks like one of the tables linked in that doc indicates that many of these broad IDs do indeed have MOA annotations. Two comments:
|
I think one possible issue could be that the 'official' CMap MoA/target annotations from batch 1 were incomplete. These annotations were (and still are) pretty consistently in flux, and it's possible the annotations in the google spreadsheet do not match those in the CMap file you've been using. They should all be annotated though, as none of them are experimental compounds. Are the annotations very different or are they different terms (or spellings) that have similar meaning? One solution I can think of is to just use whatever MoA/target annotations are currently provided in the repurposing hub as a reputable 3rd party source for this information, then freeze it with the data. I realize this might impact Adeniyi's MoA classification results. Is re-training and re-testing those classifiers prohibitive? |
I see. This is aligns with our experience. We're actually using a maximally aligned MOA file using all previous, publicly available CMAP annotation resources. In my opinion, all of these fixes should happen upstream of this repo, so I agree with this plan:
This will not actually impact @AdeboyeML's MOA classification work, since we're already using the maximally aligned annotations. We will, however, need to rerun anyway after data freeze and with spherized (aka whitened) data. In attempt to solve these problems upstream, I'll tag @jrsacher. Josh has helped us a ton in getting the best possible alignment of CMAP MOA/Target annotations. Josh, I see that you're no longer at the Broad. If you don't mind, can you connect us with the cheminformatics data scientist who would be most able to help us resolve these issues? Thanks! |
Chuck Perry ([email protected]) has taken over Repurposing from a chemistry perspective. As far as I'm aware, there isn't anyone in a pure cheminformatics role anymore, but he may be able to help with the annotation data. |
Ok cool, that sounds good to me. Thanks everyone. |
Hi @jrsacher - we are wrapping up this paper now, and we'd like to include you in our acknowledgements section. We will write something to the effect of "We'd like to thank Joshua Sacher for his help in curating Drug Repurposing Hub compound metadata." Do we have your permission to include you in this section? Thanks again for all of your expertise with this effort! |
Absolutely! I appreciate the appreciation! |
Will do! Thanks again! |
I am leaning towards doing this. To work toward reprocessing, we need to accomplish the following:
release pycytominer version 0.1. It will be great to include a stable pycytominer version in the conda environment. We've upgraded pycytominer so much since the original reprocessing, and rerunning profiles will ease headaches (see below).(Decided not to pursue)What headaches will an updated pycytominer resolve?
epsilon
in spherize()Rerunning the pipeline will also enable us to migrate from git lfs to dvc.
Time estimate
The text was updated successfully, but these errors were encountered: