Should we reprocess all profiles before frozen data release? #62

gwaybio · 2021-03-22T15:25:01Z

I am leaning towards doing this. To work toward reprocessing, we need to accomplish the following:

release pycytominer version 0.1. It will be great to include a stable pycytominer version in the conda environment. We've upgraded pycytominer so much since the original reprocessing, and rerunning profiles will ease headaches (see below). (Decided not to pursue)
update MOA map for batch 2 data (see Adding batch 2 consensus profiles #61 (comment))
- Resolved in Should we reprocess all profiles before frozen data release? #62 (comment)

What headaches will an updated pycytominer resolve?

the updated pycytominer fixes no-name gzip flab (Add --no-name gzip flag to compression file output #50)
updated naming convention "blacklist" -> "blocklist"
potential to change epsilon in spherize()

Rerunning the pipeline will also enable us to migrate from git lfs to dvc.

Time estimate

Runtime will take non-negligible time, probably ~1 week, but it will increase confidence and organization of the data.
Migrating from git lfs to dvc will take 4 hours
Releasing pycytominer version 0.1 will take longer. I think we are close to an official version 0.1 release https://github.com/cytomining/pycytominer/milestone/1

The text was updated successfully, but these errors were encountered:

shntnu · 2021-03-22T16:05:10Z

@gwaygenomics I've updated the time estimate section in your top post. I'm not sure how long 2. will take, but if it's not too long, I propose we do 1 and 2, but not 3 (unless you think it's feasible for you to do it, given everything else going on)

gwaybio · 2021-03-30T23:38:27Z

Sounds good. dvc will not take long (couple hours) i will use #63 to track 1 (I hope to get this running tomorrow) and will open a new PR for 2

shntnu · 2021-03-30T23:44:10Z

a new PR for 2

If possible, it will be super helpful if you can add some notes for migrating/setting up dvc to this issue:
cytomining/profiling-template#13
(rough notes are perfectly fine, especially given your time constraints).

gwaybio · 2021-04-01T21:34:12Z

in #61 (comment), I said:

But i wonder if I need to update the external moa file first with the new batch broad ids...

Facing this now. @shntnu, do you have any historical knowledge about how these broad ids might have differed from the pilot?

In n=1, one plate from batch 2 has only 13 MOAs matched in repurposing_info_external_moa_map_resolved.tsv, while batch 1 plates have ~60.

shntnu · 2021-04-02T12:03:34Z

do you have any historical knowledge about how these broad ids might have differed from the pilot?

@tnat1031 do you happen to know the answer to this? The details below might help recap.

library(tidyverse)
platemaps <- 
  c("https://raw.githubusercontent.com/broadinstitute/lincs-cell-painting/master/metadata/platemaps/2017_12_05_Batch2/platemap/ASG003_A549_24H.txt",
    "https://raw.githubusercontent.com/broadinstitute/lincs-cell-painting/master/metadata/platemaps/2017_12_05_Batch2/platemap/LKCP001_A549_24H.txt",
    "https://raw.githubusercontent.com/broadinstitute/lincs-cell-painting/master/metadata/platemaps/2017_12_05_Batch2/platemap/LKCP002_A549_24H.txt") 

n_cell_lines <- 3
n_time_points <- 3

lkcp_broad_samples <- 
  platemaps %>%
  map_df(read_tsv, col_types = cols()) %>% 
  distinct(broad_sample)

lkcp_broad_samples %>% 
  sample_n(10) %>%
  knitr::kable()

broad_sample
BRD-K41599323-001-01-5
BRD-K59325863-001-03-6
BRD-K19034817-001-04-8
BRD-K92723993-001-12-5
BRD-K70301876-034-06-1
BRD-K57252450-001-02-5
BRD-A87130939-001-07-9
BRD-K12906202-001-06-2
BRD-K15567136-003-03-3
BRD-A78195072-001-06-2

lkcp_broad_samples %>%
  count() %>%
  knitr::kable()

n
349

^{Created on 2021-04-02 by the reprex package (v0.3.0)}

tnat1031 · 2021-04-02T12:48:03Z

@shntnu @gwaygenomics If I recall correctly I think the batch 2 compounds (aka LKCP) were not explicitly chosen to overlap with the pilot compounds. Rather, it was an experiment designed to compare the L1000 and CP readouts with exactly the same conditions (compounds, cell lines, doses, replicates, time points exactly matched).

gwaybio · 2021-04-02T12:54:52Z

thanks @tnat1031 - the specific question is if you know why the majority of these compounds do not align with CMAP broad ID annotations. Were they experimental compounds lacking MOA/target info?

shntnu · 2021-04-02T12:56:48Z

Thanks for looking into this @tnat1031

the specific question is if you know why the majority of these compounds do not align with CMAP broad ID annotations.

Exactly

Were they experimental compounds lacking MOA/target info?

@tnat1031 perhaps this doc might help you recollect?

gwaybio · 2021-04-02T13:19:47Z

looks like one of the tables linked in that doc indicates that many of these broad IDs do indeed have MOA annotations.

Two comments:

I don't see TARGET info
It's possible that we already have all annotations present in that document, it does seem like there might be fewer than in batch 1 (I will check)

tnat1031 · 2021-04-02T13:34:52Z

I think one possible issue could be that the 'official' CMap MoA/target annotations from batch 1 were incomplete. These annotations were (and still are) pretty consistently in flux, and it's possible the annotations in the google spreadsheet do not match those in the CMap file you've been using. They should all be annotated though, as none of them are experimental compounds. Are the annotations very different or are they different terms (or spellings) that have similar meaning?

One solution I can think of is to just use whatever MoA/target annotations are currently provided in the repurposing hub as a reputable 3rd party source for this information, then freeze it with the data. I realize this might impact Adeniyi's MoA classification results. Is re-training and re-testing those classifiers prohibitive?

gwaybio · 2021-04-02T13:58:59Z

These annotations were (and still are) pretty consistently in flux, and it's possible the annotations in the google spreadsheet do not match those in the CMap file you've been using.

I see. This is aligns with our experience. We're actually using a maximally aligned MOA file using all previous, publicly available CMAP annotation resources. In my opinion, all of these fixes should happen upstream of this repo, so I agree with this plan:

One solution I can think of is to just use whatever MoA/target annotations are currently provided in the repurposing hub as a reputable 3rd party source for this information, then freeze it with the data.

This will not actually impact @AdeboyeML's MOA classification work, since we're already using the maximally aligned annotations. We will, however, need to rerun anyway after data freeze and with spherized (aka whitened) data.

In attempt to solve these problems upstream, I'll tag @jrsacher. Josh has helped us a ton in getting the best possible alignment of CMAP MOA/Target annotations. Josh, I see that you're no longer at the Broad. If you don't mind, can you connect us with the cheminformatics data scientist who would be most able to help us resolve these issues?

Thanks!

jrsacher · 2021-04-02T14:21:35Z

Chuck Perry ([email protected]) has taken over Repurposing from a chemistry perspective. As far as I'm aware, there isn't anyone in a pure cheminformatics role anymore, but he may be able to help with the annotation data.
I'm still around as a consultant to CDoT, so if there's anything technical or that Chuck isn't comfortable handling, I can probably help out.

tnat1031 · 2021-04-02T14:36:18Z

Ok cool, that sounds good to me. Thanks everyone.

shntnu · 2021-04-02T14:48:26Z

Thanks @tnat1031 and @jrsacher!

gwaybio · 2021-09-16T15:57:19Z

Hi @jrsacher - we are wrapping up this paper now, and we'd like to include you in our acknowledgements section. We will write something to the effect of "We'd like to thank Joshua Sacher for his help in curating Drug Repurposing Hub compound metadata."

Do we have your permission to include you in this section? Thanks again for all of your expertise with this effort!

jrsacher · 2021-09-16T17:06:29Z

Absolutely! I appreciate the appreciation!

gwaybio · 2021-09-16T17:09:55Z

Will do! Thanks again!

gwaybio mentioned this issue Mar 22, 2021

Adding batch 2 consensus profiles #61

Merged

gwaybio mentioned this issue Mar 30, 2021

Frozen data version 1 #63

Merged

5 tasks

gwaybio mentioned this issue Apr 2, 2021

Second batch of lincs data #57

Closed

gwaybio mentioned this issue May 31, 2021

Adding profiles to dvc #66

Closed

2 tasks

gwaybio closed this as completed Jun 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should we reprocess all profiles before frozen data release? #62

Should we reprocess all profiles before frozen data release? #62

gwaybio commented Mar 22, 2021 •

edited

Loading

shntnu commented Mar 22, 2021

gwaybio commented Mar 30, 2021

shntnu commented Mar 30, 2021

gwaybio commented Apr 1, 2021

shntnu commented Apr 2, 2021

tnat1031 commented Apr 2, 2021

gwaybio commented Apr 2, 2021

shntnu commented Apr 2, 2021

gwaybio commented Apr 2, 2021

tnat1031 commented Apr 2, 2021 •

edited

Loading

gwaybio commented Apr 2, 2021

jrsacher commented Apr 2, 2021

tnat1031 commented Apr 2, 2021

shntnu commented Apr 2, 2021

gwaybio commented Sep 16, 2021

jrsacher commented Sep 16, 2021

gwaybio commented Sep 16, 2021

Should we reprocess all profiles before frozen data release? #62

Should we reprocess all profiles before frozen data release? #62

Comments

gwaybio commented Mar 22, 2021 • edited Loading

Time estimate

shntnu commented Mar 22, 2021

gwaybio commented Mar 30, 2021

shntnu commented Mar 30, 2021

gwaybio commented Apr 1, 2021

shntnu commented Apr 2, 2021

tnat1031 commented Apr 2, 2021

gwaybio commented Apr 2, 2021

shntnu commented Apr 2, 2021

gwaybio commented Apr 2, 2021

tnat1031 commented Apr 2, 2021 • edited Loading

gwaybio commented Apr 2, 2021

jrsacher commented Apr 2, 2021

tnat1031 commented Apr 2, 2021

shntnu commented Apr 2, 2021

gwaybio commented Sep 16, 2021

jrsacher commented Sep 16, 2021

gwaybio commented Sep 16, 2021

gwaybio commented Mar 22, 2021 •

edited

Loading

tnat1031 commented Apr 2, 2021 •

edited

Loading