Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Establish protocols for DataHub submission #72

Open
Bankso opened this issue Jul 3, 2024 · 1 comment
Open

Establish protocols for DataHub submission #72

Bankso opened this issue Jul 3, 2024 · 1 comment
Assignees

Comments

@Bankso
Copy link
Contributor

Bankso commented Jul 3, 2024

As part of MC2 Center data routing, the CRDC DataHub is an expected terminal repository that 1) excepts a broad range of data types, 2) accepts human-derived datasets, and 3) has independent metadata requirements for assays, specimens, individuals, etc.

MC2 --> DataHub package prep and transfer protocols need to be established, using the available tools, guidelines, and schemas from DataHub.

Initial thoughts:

  • MC2 metadata templates will incorporate DataHub attributes where possible. Mappings will be designed as needed, but integration is higher priority
  • Building Synapse Datasets + Collections that comprise a DataHub submission package would integrate well with our release strategy (relative to Design and test process for creating and sharing files + metadata with Synapse Datasets #71)
  • Automated transfer through an API would be desirable
  • Establishing access restrictions will require some further thinking. Synapse has its own ACR implementation, but DataHub/CRDC uses DbGaP to manage ACRs and requests for sequencing and imaging data
@Bankso
Copy link
Contributor Author

Bankso commented Dec 12, 2024

Currently reviewing CRDC Submission Portal protocols and API documentation. The DataHub model explorer shows CDS model V5.0.2 is available and I am finalizing our current mapping to this model version. It is organized differently than previous versions we were provided and it looks like some of the valid values are distinct - mapping is in progress here: https://docs.google.com/spreadsheets/d/1BBEysO142rRGDhi389t5_Zp_V1RwjsdE/edit?gid=852542275#gid=852542275

By mapping, I mean that I am creating a reference that tells us which MC2 model attributes can be used to populate CDS model attributes. Valid values should be identical in all cases, unless otherwise noted.

Mapping between CDS V5.0.2 attributes and MC2 model is done for Biospecimen, Individual, Study, Imaging, and Sequencing templates

Next mappings to complete are for Model, GeoMx, and Visium

Based on these mappings, I think we should create a script that maps MC2 metadata onto CDS/Data Hub templates per a "transformation dictionary"
Script will be generalized and will take a metadata sheet package and data type as input. Dictionaries can be defined as needed and provided at run time

This will be part of a larger CRDC transfer pipeline, built around the Data Hub API, to automatically create and submit a transfer package. The package will include all files and transformed metadata required for CRDC submission.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant