Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we reprocess all profiles before frozen data release? #62

Closed
2 tasks done
gwaybio opened this issue Mar 22, 2021 · 17 comments
Closed
2 tasks done

Should we reprocess all profiles before frozen data release? #62

gwaybio opened this issue Mar 22, 2021 · 17 comments

Comments

@gwaybio
Copy link
Member

gwaybio commented Mar 22, 2021

I am leaning towards doing this. To work toward reprocessing, we need to accomplish the following:

What headaches will an updated pycytominer resolve?

Rerunning the pipeline will also enable us to migrate from git lfs to dvc.

Time estimate

  1. Runtime will take non-negligible time, probably ~1 week, but it will increase confidence and organization of the data.
  2. Migrating from git lfs to dvc will take 4 hours
  3. Releasing pycytominer version 0.1 will take longer. I think we are close to an official version 0.1 release https://github.com/cytomining/pycytominer/milestone/1
@shntnu
Copy link
Collaborator

shntnu commented Mar 22, 2021

@gwaygenomics I've updated the time estimate section in your top post. I'm not sure how long 2. will take, but if it's not too long, I propose we do 1 and 2, but not 3 (unless you think it's feasible for you to do it, given everything else going on)

@gwaybio gwaybio mentioned this issue Mar 30, 2021
5 tasks
@gwaybio
Copy link
Member Author

gwaybio commented Mar 30, 2021

Sounds good. dvc will not take long (couple hours) i will use #63 to track 1 (I hope to get this running tomorrow) and will open a new PR for 2

@shntnu
Copy link
Collaborator

shntnu commented Mar 30, 2021

a new PR for 2

If possible, it will be super helpful if you can add some notes for migrating/setting up dvc to this issue:
cytomining/profiling-template#13
(rough notes are perfectly fine, especially given your time constraints).

@gwaybio
Copy link
Member Author

gwaybio commented Apr 1, 2021

in #61 (comment), I said:

But i wonder if I need to update the external moa file first with the new batch broad ids...

Facing this now. @shntnu, do you have any historical knowledge about how these broad ids might have differed from the pilot?

In n=1, one plate from batch 2 has only 13 MOAs matched in repurposing_info_external_moa_map_resolved.tsv, while batch 1 plates have ~60.

@shntnu
Copy link
Collaborator

shntnu commented Apr 2, 2021

do you have any historical knowledge about how these broad ids might have differed from the pilot?

@tnat1031 do you happen to know the answer to this? The details below might help recap.

library(tidyverse)
platemaps <- 
  c("https://raw.githubusercontent.com/broadinstitute/lincs-cell-painting/master/metadata/platemaps/2017_12_05_Batch2/platemap/ASG003_A549_24H.txt",
    "https://raw.githubusercontent.com/broadinstitute/lincs-cell-painting/master/metadata/platemaps/2017_12_05_Batch2/platemap/LKCP001_A549_24H.txt",
    "https://raw.githubusercontent.com/broadinstitute/lincs-cell-painting/master/metadata/platemaps/2017_12_05_Batch2/platemap/LKCP002_A549_24H.txt") 

n_cell_lines <- 3
n_time_points <- 3

lkcp_broad_samples <- 
  platemaps %>%
  map_df(read_tsv, col_types = cols()) %>% 
  distinct(broad_sample)

lkcp_broad_samples %>% 
  sample_n(10) %>%
  knitr::kable()
broad_sample
BRD-K41599323-001-01-5
BRD-K59325863-001-03-6
BRD-K19034817-001-04-8
BRD-K92723993-001-12-5
BRD-K70301876-034-06-1
BRD-K57252450-001-02-5
BRD-A87130939-001-07-9
BRD-K12906202-001-06-2
BRD-K15567136-003-03-3
BRD-A78195072-001-06-2
lkcp_broad_samples %>%
  count() %>%
  knitr::kable()
n
349

Created on 2021-04-02 by the reprex package (v0.3.0)

@tnat1031
Copy link
Contributor

tnat1031 commented Apr 2, 2021

@shntnu @gwaygenomics If I recall correctly I think the batch 2 compounds (aka LKCP) were not explicitly chosen to overlap with the pilot compounds. Rather, it was an experiment designed to compare the L1000 and CP readouts with exactly the same conditions (compounds, cell lines, doses, replicates, time points exactly matched).

@gwaybio
Copy link
Member Author

gwaybio commented Apr 2, 2021

thanks @tnat1031 - the specific question is if you know why the majority of these compounds do not align with CMAP broad ID annotations. Were they experimental compounds lacking MOA/target info?

@shntnu
Copy link
Collaborator

shntnu commented Apr 2, 2021

Thanks for looking into this @tnat1031

the specific question is if you know why the majority of these compounds do not align with CMAP broad ID annotations.

Exactly

Were they experimental compounds lacking MOA/target info?

@tnat1031 perhaps this doc might help you recollect?

@gwaybio
Copy link
Member Author

gwaybio commented Apr 2, 2021

looks like one of the tables linked in that doc indicates that many of these broad IDs do indeed have MOA annotations.

Two comments:

  • I don't see TARGET info
  • It's possible that we already have all annotations present in that document, it does seem like there might be fewer than in batch 1 (I will check)

@tnat1031
Copy link
Contributor

tnat1031 commented Apr 2, 2021

I think one possible issue could be that the 'official' CMap MoA/target annotations from batch 1 were incomplete. These annotations were (and still are) pretty consistently in flux, and it's possible the annotations in the google spreadsheet do not match those in the CMap file you've been using. They should all be annotated though, as none of them are experimental compounds. Are the annotations very different or are they different terms (or spellings) that have similar meaning?

One solution I can think of is to just use whatever MoA/target annotations are currently provided in the repurposing hub as a reputable 3rd party source for this information, then freeze it with the data. I realize this might impact Adeniyi's MoA classification results. Is re-training and re-testing those classifiers prohibitive?

@gwaybio
Copy link
Member Author

gwaybio commented Apr 2, 2021

These annotations were (and still are) pretty consistently in flux, and it's possible the annotations in the google spreadsheet do not match those in the CMap file you've been using.

I see. This is aligns with our experience. We're actually using a maximally aligned MOA file using all previous, publicly available CMAP annotation resources. In my opinion, all of these fixes should happen upstream of this repo, so I agree with this plan:

One solution I can think of is to just use whatever MoA/target annotations are currently provided in the repurposing hub as a reputable 3rd party source for this information, then freeze it with the data.

This will not actually impact @AdeboyeML's MOA classification work, since we're already using the maximally aligned annotations. We will, however, need to rerun anyway after data freeze and with spherized (aka whitened) data.


In attempt to solve these problems upstream, I'll tag @jrsacher. Josh has helped us a ton in getting the best possible alignment of CMAP MOA/Target annotations. Josh, I see that you're no longer at the Broad. If you don't mind, can you connect us with the cheminformatics data scientist who would be most able to help us resolve these issues?

Thanks!

@jrsacher
Copy link

jrsacher commented Apr 2, 2021

Chuck Perry ([email protected]) has taken over Repurposing from a chemistry perspective. As far as I'm aware, there isn't anyone in a pure cheminformatics role anymore, but he may be able to help with the annotation data.
I'm still around as a consultant to CDoT, so if there's anything technical or that Chuck isn't comfortable handling, I can probably help out.

@tnat1031
Copy link
Contributor

tnat1031 commented Apr 2, 2021

Ok cool, that sounds good to me. Thanks everyone.

@shntnu
Copy link
Collaborator

shntnu commented Apr 2, 2021

Thanks @tnat1031 and @jrsacher!

@gwaybio gwaybio mentioned this issue May 31, 2021
2 tasks
@gwaybio gwaybio closed this as completed Jun 18, 2021
@gwaybio
Copy link
Member Author

gwaybio commented Sep 16, 2021

Hi @jrsacher - we are wrapping up this paper now, and we'd like to include you in our acknowledgements section. We will write something to the effect of "We'd like to thank Joshua Sacher for his help in curating Drug Repurposing Hub compound metadata."

Do we have your permission to include you in this section? Thanks again for all of your expertise with this effort!

@jrsacher
Copy link

Absolutely! I appreciate the appreciation!

@gwaybio
Copy link
Member Author

gwaybio commented Sep 16, 2021

Will do! Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants