diff --git a/.DS_Store b/.DS_Store index e8551c28..a7022222 100644 Binary files a/.DS_Store and b/.DS_Store differ diff --git a/access_controlled/introduction.md b/access_controlled/introduction.md index 41144ff1..a1827f3b 100644 --- a/access_controlled/introduction.md +++ b/access_controlled/introduction.md @@ -4,4 +4,4 @@ order: 1000 # Access-Controlled Data -Access-controlled HTAN data requires dbGaP access approval for study [phs002371](https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs002371.v3.p1), and is currently only available via the National Cancer Institute's Cancer Data Services (CDS). +Access-controlled HTAN data requires dbGaP access approval for study [phs002371](https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs002371.v3.p1), and is currently only available via the [National Cancer Institute's Cancer Data Services (CDS)](https://dataservice.datacommons.cancer.gov/#/home). diff --git a/addtnl_info/RFC.md b/addtnl_info/RFC.md new file mode 100644 index 00000000..f5fac306 --- /dev/null +++ b/addtnl_info/RFC.md @@ -0,0 +1,38 @@ +--- +order: 996 +--- + +# The RFC Process and Data Model Changes + +## RFC Overview +The HTAN Data Model is expected to evolve with advances in science. This evolution is a community-driven, peer-reviewed process, where members of a working group will first assess established community data standards and create a request for comment (RFC) document soliciting community feedback. + +The status of current RFCs is provided in the [RFC Overview](https://docs.google.com/document/d/1dJ7NUoVCtewdtny8bITwtWnzItB4IibL5kJO3ZNh0go/edit?usp=sharing) document. The RFC Overview can be used to: + +- Get a sense of what is available in DCA. +- Get a sense of new assays being considered. +- Look at old RFCs & get a sense of past discussions/considerations. + +!!! Note: +The links to specific RFC documents within the RFC Overview do **not** represent the final data model. Once an RFC is closed and an assay is available on the Data Curator App (DCA), the metadata template on the DCA represents the final data model. Details regarding the data model are also available on HTAN's [Data Standards page](https://humantumoratlas.org/standards) and HTAN's [data-models repository](https://github.com/ncihtan/data-models) on github. +!!! + +## Data Model Changes +The following are requests which require changes to the Data Model and may result in the initiation of a RFC: + +- New assay types which are expected to be used frequently by multiple centers. +- New metadata templates or additional required metadata fields which should be validated. + +HTAN members should contact their [data liaison](../data_submission/Data_Liaisons.md) for help determining whether a Data Model change is needed and how to make a Data Model change request. + +## RFC Process +Once a new assay type or a set of needed Data Model changes are identified, the following steps are taken: + +1. **A working group is organized** by the Data Coordinating Center (DCC). As a part of this process, the following people are also designated: + * A **DCC Owner**, who is responsible for finalizing the RFC and overall accepting/rejecting/integrating community feedback. The DCC Owner is also the primary point of contact for the specified RFC. + * A **single DCC PI**, to monitor progress towards completion. + * One or more **Co-Authors** from one or more HTAN centers, to help draft the RFC. Representatives from each HTAN center help identify individuals at their center who can contribute to a particular RFC. +2. **A first draft of an RFC Google Document is created** based upon feedback from the working group. +3. **The RFC is open for public comment**. All HTAN members can provide suggestions by adding comments directly to the document. +4. After a designated period of time, the **RFC is closed**. Feedback from HTAN community is no longer accepted. The content of the RFC will be reflected in the respective version of the HTAN Data Model used for validating metadata files uploaded to the DCC. +5. **The metadata template is available on the [Data Curator App (DCA)](https://dca.app.sagebionetworks.org/).** diff --git a/addtnl_info/WG_internal.md b/addtnl_info/WG_internal.md new file mode 100644 index 00000000..1f688a7f --- /dev/null +++ b/addtnl_info/WG_internal.md @@ -0,0 +1,12 @@ +--- +order: 997 +--- + +# Working Groups and Internal Communications + +Information regarding Network Working Groups and Internal Communications can be found on [HTAN's Synapse Wiki page](https://www.synapse.org/#!Synapse:syn17022193/wiki/584990). Access to the HTAN Wiki is restricted to HTAN Members. + +!!! Note + +The HTAN Synapse Wiki page is restricted to HTAN members. Please contact htandcc@ds.dfci.harvard.edu if you are a member of HTAN and need access to the wiki. +!!! diff --git a/addtnl_info/data_release.md b/addtnl_info/data_release.md new file mode 100644 index 00000000..5571d8e2 --- /dev/null +++ b/addtnl_info/data_release.md @@ -0,0 +1,12 @@ +--- +order: 998 +--- + +# Data Release + +The Data Coordinating Center (DCC) prepares major data releases every 4-6 months. HTAN Centers are notified of the data submission deadline for an upcoming data release. After that deadline, the pre-release process involves a number of data processing and metadata verification steps. Data is released via the HTAN Data Portal, and then disseminated to various Cancer Data Research Commons (CRDC) nodes including Cancer Data Service (CDS) and the Institute for Systems Biology Cancer Gateway in the Cloud (ISB-CGC) to enable download of controlled-access data and long-term cloud access + +![The HTAN Data Release Process](../img/Data_release.svg) + +Please see [HTAN Data Release Process](https://docs.google.com/document/d/15xvIbfyQmgbMD_uB2e0SwPFw67_AePB5YspF4dsilCA/edit#heading=h.tddsmkcn4p1p) for more information regarding the data release process. + diff --git a/addtnl_info/index.yml b/addtnl_info/index.yml new file mode 100644 index 00000000..ff50be84 --- /dev/null +++ b/addtnl_info/index.yml @@ -0,0 +1,2 @@ +label: Additional Information +order: 995 diff --git a/addtnl_info/publications.md b/addtnl_info/publications.md new file mode 100644 index 00000000..f3b038ae --- /dev/null +++ b/addtnl_info/publications.md @@ -0,0 +1,12 @@ +--- +order: 999 +--- + +# Submitting Publications + +To facilitate data sharing and adherence to FAIR (Findability, Accessibility, Interoperability, and Reusability) principles, the HTAN portal provides links to specimen files used in publications. Currently, the HTAN Data Coordinating Center (DCC) faciliates this linking once provided the appropriate information by HTAN Centers. To submit publication information, HTAN Center's should contact Alex Lash at alexl@ds.dfci.harvard.edu. + +!!! *In order to support data sharing and public data access, the DCC encourages authors using HTAN data to either:* +* *use HTAN identifiers in their publication; or* +* *provide a lookup table in the publication to map publication identifiers to HTAN identifiers.* +!!! diff --git a/addtnl_info/tnps.md b/addtnl_info/tnps.md new file mode 100644 index 00000000..00d08e81 --- /dev/null +++ b/addtnl_info/tnps.md @@ -0,0 +1,21 @@ +--- +order: 995 +--- + +# Trans-Network Projects (TNPs) +Trans-Network Projects are multi-center projects created to facilitate collaborative research. Examples include cross-testing experimental and analytical protocols, exchange of personnel to disseminate SOPs or pursuit of additional HTAN critical methods or technologies. Specific information about each TNP is available on [HTAN's Synapse Wiki page](https://www.synapse.org/#!Synapse:syn17022193/wiki/584990) for HTAN members. + +!!! Note + +The HTAN Synapse Wiki page is restricted to HTAN members. Please contact htandcc@ds.dfci.harvard.edu if you are a member of HTAN and need access to the wiki. +!!! + + +Current Trans-Network Projects + +| Code | Name | Description | +|------|------|-------------| +| HTA13 | TNP SARDANA | The **S**h**a**red **R**epositories, **D**ata, **An**alysis and **A**ccess TNP focuses on optimizing the repeatability, interpretability and accessibility of HTAN characterization methods and the data they generate. | +| HTA14 | TNP TMA | The **T**issue **M**icro**A**rray TNP extends the TNP SARDANA characterization and analytics methodologies for evaluation and validation to a large array of breast tumor TMA samples that provide a broad spectrum of disease states and subtypes. | +| HTA15 | TNP SRRS | The **S**tandardized **R**epository of **R**eference **S**pecimens TNP's mission is to assemble an extensive catalogue of cases from premalignant lesions, pre- and post-treatment tumor tissue and metastatic tumor tissue for protocol optimization and validation. | +| HTA16 | TNP CASI | The goal of the **C**ell **A**nnotations and **S**ignatures **I**nitiative TNP is to provide robust and accurate tools for cell type annotation from single-cell data. | \ No newline at end of file diff --git a/addtnl_info/tool_protocol.md b/addtnl_info/tool_protocol.md new file mode 100644 index 00000000..3d605cf1 --- /dev/null +++ b/addtnl_info/tool_protocol.md @@ -0,0 +1,18 @@ +--- +order: 1000 +--- + +# Tool and Protocol Curation + +Computational tools developed or used to support HTAN research projects can be added to the HTAN tool catalog by filling out the tool curation form available on [HTAN's Synapse Wiki page](https://www.synapse.org/#!Synapse:syn17022193/wiki/584990). + + +Information regarding how protocols are developed/shared is also available on [HTAN's Synapse Wiki page](https://www.synapse.org/#!Synapse:syn17022193/wiki/584990). + + +!!! Note + +The HTAN Synapse Wiki page is restricted to HTAN members. Please contact htandcc@ds.dfci.harvard.edu if you are a member of HTAN and need access to the wiki. +!!! + + diff --git a/CNAME b/archive_CNAME similarity index 100% rename from CNAME rename to archive_CNAME diff --git a/data_model/biospecimens.md b/data_model/biospecimens.md deleted file mode 100644 index cc8b7634..00000000 --- a/data_model/biospecimens.md +++ /dev/null @@ -1,17 +0,0 @@ ---- -order: 997 ---- - -# Biospecimen Metadata - -The HTAN biospecimen data model is designed to capture essential biospecimen data elements, including: - -- Acquisition method, e.g. autopsy, biopsy, fine needle aspirate, etc. -- Topography Code, indicating site within the body, e.g. based on [ICD-O-3](https://seer.cancer.gov/icd-o-3/). -- Collection information e.g. time, duration of ischemia, temperature, etc. -- Processing of parent biospecimen information e.g. fresh, frozen, etc. -- Biospecimen and derivative clinical metadata ie Histologic Morphology Code, e.g. based on [ICD-O-3](https://seer.cancer.gov/icd-o-3/). -- Coordinates for derivative biospecimen from their parent biospecimen. -- Processing of derivative biospecimen for downstream analysis e.g. dissociation, sectioning, analyte isolation, etc. - -Complete details are available online at: https://data.humantumoratlas.org/standard/biospecimen diff --git a/data_model/clinical.md b/data_model/clinical.md deleted file mode 100644 index 734a2675..00000000 --- a/data_model/clinical.md +++ /dev/null @@ -1,29 +0,0 @@ ---- -order: 998 ---- - -# Clinical Metadata - -The HTAN clinical data model consists of three tiers. Tier 1 is in alignment with the [Genomic Data Commons (GDC)](https://gdc.cancer.gov/) guidelines for clinical data, while Tiers 2 and 3 are HTAN extensions to the GDC model. - -| Tier | Notes | -| ---- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| 1 | Tier 1 is based entirely on the clinical data model used by the NCI Genomic Data Commons (GDC) [6]. It consists of seven categories of clinical data (see GDC Table below). | -| 2 | Disease-agnostic extensions to the GDC Clinical Data Model. | -| 3 | Disease-specific extensions to the GDC Clinical Data Model. This covers additional elements for Acute Lymphoblastic Leukemia (ALL), Brain Cancer, Breast Cancer, Lung Cancer, Melanoma, Ovarian Cancer, Pancreatic Cancer, Prostate Cancer and Sarcoma. | - -## GDC Clinical Data Model - -The GDC Clinical Data Model consists of seven categories of clinical data. - -| GDC Category | GDC Description | -| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| Demographics | Data for the characterization of the patient by means of segmenting the population (e.g., characterization by age, sex, or race). | -| Diagnosis | Data from the investigation, analysis and recognition of the presence and nature of disease, condition, or injury from expressed signs and symptoms; also, the scientific determination of any kind; the concise results of such an investigation. | -| Exposure | Clinically relevant patient information not immediately resulting from genetic predispositions. | -| Family History | Record of a patient's background regarding cancer events of blood relatives. | -| Follow-up | A visit by a patient or study participant to a medical professional. A clinical encounter that encompasses planned and unplanned trial interventions, procedures and assessments that may be performed on a subject. A visit has a start and an end, each described with a rule. The process by which information about the health status of an individual is obtained before and after a study has officially closed; an activity that continues something that has already begun or that repeats something that has already been done. | -| Molecular Test | Information pertaining to any molecular tests performed on the patient during a clinical event. | -| Therapy | Record of the administration and intention of therapeutic agents provided to a patient to alter the course of a pathologic process. | - -Complete details regarding all clinical data elements is available at: https://data.humantumoratlas.org/standard/clinical diff --git a/data_model/data_standards.md b/data_model/data_standards.md new file mode 100644 index 00000000..1c1ef570 --- /dev/null +++ b/data_model/data_standards.md @@ -0,0 +1,11 @@ +--- +order: 997 +--- + +# Data Standards + +This page is a place holder for a data standards page/set of data standards pages similar to [MC2 Center Data Model](https://mc2-center.github.io/data-models/). The HTAN version of the MC2 tables will include additional columns such as "required_if_component" and "required_if_value". Until the new pages are constructed, please see the information on the [Data Standards](https://humantumoratlas.org/standards) page of the HTAN Data Portal. + +!!! Note +Once these pages are added, the [Data Standards](https://humantumoratlas.org/standards) page will be removed from the data portal. All links to "Data Standards" throughout this manual will need to be updated. +!!! \ No newline at end of file diff --git a/data_model/identifiers.md b/data_model/identifiers.md index cc8dcef2..e0e5b2ef 100644 --- a/data_model/identifiers.md +++ b/data_model/identifiers.md @@ -15,24 +15,8 @@ Research participants are identified with the following pattern: ::= _integer ``` -Where the `htan_center_id` is derived from the identifier prefix table below. - -| HTAN Center ID | Pilot Project or Contact PI Institution | -| -------------- | --------------------------------------- | -| HTA1 | HTAPP Pilot Project | -| HTA2 | PCAPP Pilot Project | -| HTA3 | Boston University | -| HTA4 | Children's Hospital of Philadelphia | -| HTA5 | Dana-Farber Cancer Institute | -| HTA6 | Duke University | -| HTA7 | Harvard Medical School | -| HTA8 | Memorial Sloan Kettering Cancer Center | -| HTA9 | Oregon Health Sciences University | -| HTA10 | Stanford University | -| HTA11 | Vanderbilt University | -| HTA12 | Washington University | -| HTA13 | TNP SARDANA | -| HTA14 | TNP TMA | +Where the `htan_center_id` is the HTAN Center Prefix. (e.g. HTA1, HTA2) Please see [HTAN Centers](../overview/centers.md) for a full list of HTAN Center prefixes. + Derivative data includes anything derived from a research participant, including biospecimens such as samples, tissue blocks, slides, aliquots, analytes, and data files that result from assaying those biospecimens. These identifiers follow the pattern: @@ -40,13 +24,14 @@ Derivative data includes anything derived from a research participant, including ::= _integer ``` -For example, if research participant 1 within the CHOP project has provided three samples, you would have three HTAN IDs, such as: +For example, if research participant 1 within the CHOP project (HTA4) has provided three samples, you would have three HTAN IDs, such as: ``` HTA4_1_1 HTA4_1_3 HTA4_1_8 ``` +## Special Identifiers If a single data file is generated from one of those samples, that file could have an HTAN ID such as: @@ -54,6 +39,24 @@ If a single data file is generated from one of those samples, that file could ha HTA4_1_42 ``` +If a single data file is derived from more than one participant, the file identifier may contain a wildcard string e.g. ‘0000’, after the HTAN center identifier. For example: + +``` +HTA4_0000_1 +HTA4_0000_2 +HTA4_0000_3 +``` + +If a data file is derived from an external control participant, the biospecimen and file identifiers will contain the string ‘EXT’ before the external control participant integer. For example: + +``` +HTA4_EXT1_1 +HTA4_EXT2_2 +HTA4_EXT3_3 +``` + More detailed information about HTAN Identifiers may be found in the [HTAN Identifiers SOP](https://docs.google.com/document/d/1podtPP8L1UNvVxx9_c_szlDcU1f8n7bige6XA_GoRVM/edit#heading=h.768a6pngjha3). +## ID to ID linkages + Note that the explicit linking of participants to biospecimens to assays is not encoded in the HTAN Identifier. Rather, the linking is encoded in explicit metadata elements (see [Relationship Model](relationships.md)). diff --git a/data_model/imaging.md b/data_model/imaging.md deleted file mode 100644 index 3ab7bfec..00000000 --- a/data_model/imaging.md +++ /dev/null @@ -1,22 +0,0 @@ ---- -order: 995 ---- - -# Imaging Data - -The HTAN data model for imaging data is based upon the [Minimum Information about Tissue Imaging (MITI)](https://www.miti-consortium.org/) reporting guidelines. These comprise minimal metadata for highly multiplexed tissue images and were developed in consultation with methods developers, experts in imaging metadata (e.g., DICOM and OME) and multiple large-scale atlas projects; they are guided by existing standards and accommodate most multiplexed imaging technologies and both centralized and distributed data storage. - -For further information on the MITI guidelines, please see the [MITI website](https://www.miti-consortium.org/), [specification on Github](https://github.com/miti-consortium/MITI), and [Nature Methods publication](https://www.nature.com/articles/s41592-022-01415-4). - -The HTAN data model for imaging was intended primarily for multiplexed imaging such as CODEX, CyCIF, and IMC, in addition to brightfield imaging of H&E stained tissues. - -As with Sequencing data, the imaging data model is split into data levels as follows: - -| Level | Description | -| ----- | -------------------------------------------------------------------------------------------------------------------------------------------------- | -| 1 | Raw imaging data requiring tiling, stitching, illumination correction, registration or other pre-processing. | -| 2 | Imaging data compiled into a single file format, preferably a tiled and pyramidal OME-TIFF. Accompanied by a csv file containing channel metadata. | -| 3 | Segmentation mask, Validated channel metadata, QC checked image. | -| 4 | An object-by-feature table (typically cell-by-marker) generated from the segmentation mask and image. | - -Before preparing imaging data for upload to DCC, please consult [HTAN Imaging Data Requirements](https://docs.google.com/document/d/1iNicigsSytekEQLkmeNJd2NOJ9VTKzBDfYj3BmvGcro/edit#heading=h.b6j67xcu50c2). diff --git a/data_model/overview.md b/data_model/overview.md index 45e654b0..82781736 100644 --- a/data_model/overview.md +++ b/data_model/overview.md @@ -9,5 +9,5 @@ All HTAN Centers are required to encode their data and metadata in the **common As much as possible, the HTAN Data Model leverages previously defined data standards across the scientific research community, including the [NCI Genomic Data Commons](https://gdc.cancer.gov/), the [Human Cell Atlas](https://www.humancellatlas.org/), the [Human Biomolecular Atlas Program (HuBMAP)](https://hubmapconsortium.org/) and the [Minimum Information about Tissue Imaging (MITI)](https://www.miti-consortium.org/) reporting guidelines. !!! Data Standards -Complete information regarding the HTAN Data Model is available at: https://data.humantumoratlas.org/standards. +Complete information regarding the HTAN Data Model and specific data elements is available at: https://data.humantumoratlas.org/standards. !!! diff --git a/data_model/relationships.md b/data_model/relationships.md index 14a4d476..1284e0e9 100644 --- a/data_model/relationships.md +++ b/data_model/relationships.md @@ -1,5 +1,5 @@ --- -order: 993 +order: 998 --- # Relationship Model diff --git a/data_model/sequencing.md b/data_model/sequencing.md deleted file mode 100644 index 64e6730f..00000000 --- a/data_model/sequencing.md +++ /dev/null @@ -1,20 +0,0 @@ ---- -order: 996 ---- - -# Sequencing Data - -HTAN supports multiple sequencing modalities including Single Cell and Single Nucleus RNA Seq (sc/snRNASeq), Single Cell ATAC Seq, Bulk RNA Seq and Bulk DNA Seq. - -The HTAN standard for gene annotations is [GENCODE Version 34](https://www.gencodegenes.org/human/release_34.html). [GENCODE](https://www.gencodegenes.org/) is used for gene definitions by many consortia, including ENCODE, NCI Genomic Data Commons, Human Cell Atlas, and PCAWG (Pan-Cancer Analysis of Whole Genomes). Ensembl gene content is essentially identical to that of GENCODE ([FAQ](https://www.gencodegenes.org/pages/faq.html)) and interconversion is possible. - -HTAN has adopted the [GENCODE 34](https://www.gencodegenes.org/human/release_34.html) Gene Transfer Format ([GTF](https://useast.ensembl.org/info/website/upload/gff.html)) comprehensive gene annotation file (GENCODE 34 GTF) and filtered files (GENCODE 34 GTF with genes only; GENCODE 34 GTF with genes only and retaining only chromosome X copy of pseudoautosomal region) for HTAN gene annotation. Note that HTAN also includes data generated with other gene models, as the process of implementing the standard is ongoing. Within HTAN metadata files, the reference genome used can be found in the attribute “Genomic Reference” and “Genomic Reference URL”. - -In alignment with The Cancer Genome Atlas and the NCI Genomic Data Commons, sequencing data are divided into four levels: - -| Level | Definition | Example Data | -| ----- | -------------------------- | ---------------------------------------- | -| 1 | Raw data | FASTQs, unaligned BAMs | -| 2 | Aligned primary data | Aligned BAMs | -| 3 | Derived biomolecular data | Gene expression matrix files, VCFs, etc. | -| 4 | Sample level summary data. | t-SNE plot coordinates, etc. | diff --git a/data_model/spatial_transcriptomics.md b/data_model/spatial_transcriptomics.md deleted file mode 100644 index ebac2531..00000000 --- a/data_model/spatial_transcriptomics.md +++ /dev/null @@ -1,20 +0,0 @@ ---- -order: 994 ---- - -# Spatial Transcriptomics - -The HTAN data model for spatial transcriptomics data is based upon both imaging and single cell sequencing data models. These form a collection of metadata fields where transcriptomic levels (or gene or protein level measures) can be mapped to locations on a tissue slide, and were developed in consultation with the data generating centers who are both experts in imaging metadata (e.g. DICOM and OME) and multiple large-scale atlas projects. - -The HTAN data model currently supports 10X Visium data, but additional platforms will be added in the near future including Nanostring GeoMX and Pick-Seq. - -Spatial transcriptomic datasets are typically comprised of RNA-sequencing data at varying levels, coupled with imaging data and an auxiliary set of files used in or generated by processing workflows for spatial transcriptomics: - -| Level | Description | -| --------------- | ------------------------------------------------------------------------------------------------------------------------------ | -| Spatial Transcriptomics RNA-seq Level 1 | Files contain raw RNA-seq data associated with spot/slide data. | -| Spatial Transcriptomics RNA-seq Level 2 | Alignment workflows downstream of Spatial Transcriptomics RNA-seq Level 1. | -| Spatial Transcriptomics RNA-seq Level 3 | Processed data files based on Spatial Transcriptomics RNA-seq Level 2 and Spatial Transcriptomics Auxiliary files. | -| Spatial Transcriptomics RNA-seq Level 4 | Processed data files based on Spatial Transcriptomics RNA-seq Level 3. | -| Spatial Transcriptomics Auxiliary Files | Auxiliary data associated with spot/slide analysis (aligned Images, quality control files, etc) from Spatial Transcriptomics. | -| Imaging Level 2 | Imaging data compiled into a single file format, preferably a tiled and pyramidal OME-TIFF. | diff --git a/data_submission/Data_Deidentification.md b/data_submission/Data_Deidentification.md new file mode 100644 index 00000000..8d7bfcd9 --- /dev/null +++ b/data_submission/Data_Deidentification.md @@ -0,0 +1,26 @@ +--- +order: 998 +--- + +# Data De-identification + +!!! **HTAN Centers are responsible for data deidentification.** +!!! + +As outlined in the HTAN [DMSA](https://docs.google.com/document/d/1RPFm9MBJv8DjZmYZyIv0jbjtNJ8fnwGjYDjlK4lL4nc/edit#heading=h.gjdgxs), data submitted to the Data Coordinating Center (DCC) must be fully de-identified. + +By signing the HTAN DMSA, HTAN members’ institutional signing officials and PIs accept responsibility for the de-identification of data prior to transfer to the DCC and confirm that: + +- all data disclosed to the DCC (Synapse) are fully de-identified in accordance with HIPAA; + +- all data were collected in accordance with protocols approved by an IRB or its equivalent; + +- all data are consistent with applicable U.S. laws, regulations, and its institutional policies; and + +- an IRB/Privacy Board or equivalent body has assured that submission and subsequent sharing of data are consistent with the Informed Consent of the Data Subject(s) from whom the data were obtained. In addition, the data are protected by an NIH Certificate of Confidentiality. + +New HTAN Centers should develop and submit a De-identification Plan using the [HTAN Atlas De-Identification Plan Template](https://docs.google.com/document/d/1jFXYVMhLEGVcMNBh3L-U1rr4B-KLG7gK-ETc1SzMHhs/edit?usp=sharing). + +Prior to transferring data to the HTAN DCC, members are responsible for fully de-identifying the data being transferred. Full de-identification of data includes confirmation that data file names do not contain any information that could be used to re-identify that data subject. + + diff --git a/data_submission/Data_Liaisons.md b/data_submission/Data_Liaisons.md new file mode 100644 index 00000000..ee495707 --- /dev/null +++ b/data_submission/Data_Liaisons.md @@ -0,0 +1,29 @@ +--- +order: 997 +--- + +# Data Liaisons + +Upon joining HTAN, Centers are assigned a data liaison from the DCC. Trans-Network Projects (TNPs) are also assigned liaisons. The DCC liaisons assist each of the research centers in successfully uploading data and metadata files. + +Here is the current list of centers, their atlases and DCC liaisons: + +| Atlas | Atlas ID | Liaison | Email | +|-------|----------|---------|-------| +| PILOT - HTAPP | HTA1 | Vesteinn Thorsson | thorsson@isbscience.org | +| PILOT - PCAPP | HTA2 | Jennifer Altreuter | jennifer@ds.dfci.harvard.edu | +| HTAN BU | HTA3 | Jennifer Altreuter | jennifer@ds.dfci.harvard.edu | +| HTAN CHOP | HTA4 | Ino de Bruijn | debruiji@mskcc.org | +| HTAN DFCI | HTA5 | Jennifer Altreuter | jennifer@ds.dfci.harvard.edu | +| HTAN Duke | HTA6 | Ino de Bruijn | debruiji@mskcc.org | +| HTAN HMS | HTA7 | Adam Taylor | adam.taylor@sagebase.org | +| HTAN MSK | HTA8 | Ino de Bruijn | debruiji@mskcc.org | +| HTAN OHSU | HTA9 | Adam Taylor | adam.taylor@sagebase.org | +| HTAN Stanford | HTA10 | Adam Taylor | adam.taylor@sagebase.org | +| HTAN Vanderbilt | HTA11 | Clarisse Lau | clau@systemsbiology.org | +| HTAN WUSTL | HTA12 | Clarisse Lau | clau@systemsbiology.org | +| TNP SARDANA | HTA13 | Dave Gibbs | dgibbs@systemsbiology.org | +| TNP TMA | HTA14 | Dave Gibbs | dgibbs@systemsbiology.org | +| TNP SRRS | HTA15 | Clarisse Lau | clau@systemsbiology.org | +|| HTA15 | Jennifer Altreuter (snRNAseq) | jennifer@ds.dfci.harvard.edu | +| TNP CASI | HTA16 | Jennifer Altreuter | jennifer@ds.dfci.harvard.edu | \ No newline at end of file diff --git a/data_submission/Information_New_Centers.md b/data_submission/Information_New_Centers.md new file mode 100644 index 00000000..235bcb93 --- /dev/null +++ b/data_submission/Information_New_Centers.md @@ -0,0 +1,17 @@ +--- +order: 999 +--- + +# Information for New HTAN Centers + +Welcome to the Human Tumor Atlas Network! + +The [Resources page](https://humantumoratlas.org/resources) of the HTAN Data Portal provides documentation and applicable policies detailing the requirements for publications, data sharing, and data use. + +All HTAN members must have an executed Human Tumor Atlas Network DMSA (Internal Data and Materials Sharing Agreement) [(HTAN DMSA)](https://docs.google.com/document/d/1RPFm9MBJv8DjZmYZyIv0jbjtNJ8fnwGjYDjlK4lL4nc/edit) with Sage Bionetworks prior to contributing data to the HTAN Data Coordinating Center [(DCC)](https://humantumoratlas.org/htan-dcc). To initiate execution of the HTAN DMSA, contact Sage HTAN Governance (htan@sagebionetworks.jira.com). Please include the name and contact information of your HTAN PI and Institution Signatory to enable routing the HTAN DMSA for execution. + +HTAN Centers are assigned a [data liaison](../data_submission/Data_Liaisons.md) from the [(DCC)](https://humantumoratlas.org/htan-dcc). Trans-Network Projects (TNPs) are also assigned liaisons. Your liaison will help guide you through setting up a new atlas or project, creating HTAN identifiers, and submitting metadata and data files. Please keep your liaison informed of publications timelines and new data submissions. + +Please see the appropriate page of this manual for additional details about HTAN center responsibilities for data de-identification, including submitting a data de-identification plan, and specific instructions regarding how to submit data. Clinical, biospecimen, and assay data submitted to the DCC are distributed to repositories based on access levels. Information regarding how data are accessed by external users is described more in [other parts of this manual](../overview/data_levels.md). + +In order to support the FAIR (Findability, Accessibility, Interoperability, and Reusability) principles for scientific data production, the DCC has developed a [data model](../data_model/) based on established standards in the scientific research community. The HTAN Data Model is expected to evolve with advances in science. This evolution is a community-driven, peer-reviewed process, where members of a working group will first assess established community data standards and create a request for comment (RFC) document soliciting community feedback. The RFC process is described in more detail in later pages of this manual. We look forward to working with you and learning from your expertise as we improve upon our current model. \ No newline at end of file diff --git a/data_submission/clin_biospec_assay.md b/data_submission/clin_biospec_assay.md new file mode 100644 index 00000000..a2b16153 --- /dev/null +++ b/data_submission/clin_biospec_assay.md @@ -0,0 +1,73 @@ +--- +order: 995 +--- + +# Submitting Assay Data and Metadata + +As stated in [Data Submission Overview](../data_submission/overview.md), data submission involves two key steps: +1. Uploading assay data files to Synapse; and +2. Completing and validating metadata using the Data Curator App (DCA). + +This page provides details regarding those steps. + +![HTAN Data Submission Process](../img/Data_submission.svg) + +To submit data, you will also need to understand the HTAN data model and specific requirements for your particular data type. For a general overview of the HTAN data model, please see [HTAN Data Model](../data_model/overview.md). To understand specific requirements for your data type, please see [Data Standards](https://humantumoratlas.org/standards). + +HTAN uses the Synapse [Portal](https://www.synapse.org) and [DCA](https://dca.app.sagebionetworks.org/), developed and maintained by [Sage Bionetworks](https://sagebionetworks.org/), to manage clinical, biospecimen and assay data submissions (dataset ingress). In order to submit data, your center should: + +1. [Have at least one user with Certified User status on Synapse.](#have-at-least-one-user-with-certified-user-status-on-synapse) +2. [Contact your Data Liaison to set up your project and cloud bucket.](#contact-your-data-liaison-to-set-up-your-project-and-cloud-bucket) +3. [Ensure the assay dataset conforms to the HTAN Data Model, uses HTAN Identifiers and does not contain Protected Health Information (PHI).](#ensure-the-dataset-conforms-to-the-htan-data-model-uses-htan-identifiers-and-does-not-contain-phi) +4. [Organize and upload your dataset to the Synapse Project](#organize-and-upload-your-dataset-to-the-synapse-project) +5. [Validate and submit metadata using the DCA.](#validate-and-submit-metadata-using-synapses-data-curator-app-dca) + +Please read the rest of this page for more information about each of these steps. + +## Have at least one user with Certified User status on Synapse. +To upload files to the Synapse Platform, you need to be a [Synapse Certified User](https://help.synapse.org/docs/Synapse-User-Account-Types.2007072795.html). Because Synapse stores data from human subjects research, Sage Bionetworks requires that you demonstrate understanding of and compliance with privacy and security issues. You can complete your certification by taking a short certification quiz. Please see the Synapse [Certified User Documentation](https://help.synapse.org/docs/Synapse-User-Account-Types.2007072795.html) for more information. + +## Contact your Data Liaison to set up your project and cloud bucket. + +When you are ready to upload data, please contact your [data liaison](../data_submission/Data_Liaisons.md). Your data liaison will need to know: +1. Your centers +2. Who on your team will be doing the data upload. +3. The synapse usernames for team members identified in #2. + +Please have users obtain certified user status prior to contacting your data liaison. + +With the above information, the DCC will initialize your Synapse project for metadata submission and a cloud storage location for dataset uploads. If the data submission is for a new atlas, the DCC will also create an HTAN atlas ID. Once your Synapse project has been initialized, your data liaison will reach out to you with the location of your Synapse project and you can begin uploading your data. + +## Ensure the dataset conforms to the HTAN Data Model, uses HTAN Identifiers and does not contain PHI. + +The HTAN Data Model is built upon data standards described on the [Data Standards](https://data.humantumoratlas.org/standards) page. All HTAN Centers are required to encode their clinical, biospecimen and assay data and metadata using the HTAN Data Model. If you have a new data type which is not currently represented in the HTAN Data Model, please contact your data liaison. + +A concrete way to understand the expectations for data submissions is to view the metadata templates (manifests) for clinical, biospecimen and assay data available in the ([DCA](https://dca.app.sagebionetworks.org/)). For any given dataset, you may be submitting: + +- clinical manifest(s), e.g. Demographics, Diagnosis +- biospecimen manifest(s) +- assay manifest(s), e.g. Bulk RNA-seq level 1 +- assay data files + +The first three items will be validated and submitted using the DCA. The last item, assay data files, only needs to be uploaded to the synapse project itself. + +All data should be identified using HTAN identifiers. Please see the [HTAN Identifier](../data_model/identifiers.md) section of this manual for more information regarding HTAN identifiers. + +!!! *Please review your data to ensure that it does not contain PHI.* +!!! + +## Organize and upload your dataset to the Synapse Project + +Please organize your data using the flattened data layout described in Synapse's [Data Ingress Docs](https://dca-docs.scrollhelp.site/DCA/Working-version/HTAN/organize-your-data-upload#OrganizeyourDataUpload-FlattenedDataLayoutExample) + +Data files can be transferred using the Synapse User Interface (Synapse UI) or programmatically. Please see Synapse's [Data Ingress Docs](https://dca-docs.scrollhelp.site/DCA/Working-version/HTAN/uploading-data) for more information on how to upload files. + +!!! If you upload files to Synapse programmatically, please use synapseclient version 3.0.0 or higher. +!!! + + +## Validate and submit metadata using Synapse's Data Curator App (DCA). + +The DCA contains HTAN-specific metadata templates which can be completed on the app or downloaded. Once these are completed by your center, they should then be validated and submitted via the DCA. + +Please see Synapse's [Data Ingress Docs](https://dca-docs.scrollhelp.site/DCA/Working-version/HTAN/validate-and-submit-your-metadata) for more details regarding the web app. diff --git a/data_submission/index.yml b/data_submission/index.yml new file mode 100644 index 00000000..fe11ae5d --- /dev/null +++ b/data_submission/index.yml @@ -0,0 +1,2 @@ +label: Submitting Data +order: 996 diff --git a/data_submission/metadata.md b/data_submission/metadata.md new file mode 100644 index 00000000..cd6cd3c7 --- /dev/null +++ b/data_submission/metadata.md @@ -0,0 +1,17 @@ +--- +order: 996 +--- + +# What is Metadata? + +Metadata means data *about* data. Metadata enables both data searchability and interpretability. For HTAN, this includes sample and case identifiers, patient information (e.g. demographics), biospecimen information (e.g. tumor type), and assay-specific information (e.g. experiment protocol, assay reagents or assay technology). + +![Example HTAN Metadata vs Assay Data](../img/metadata.svg) + +HTAN's [Data Model](../data_model/overview.md) is a framework for collecting and storing metadata. The Data Model in turn supports effective searching for data on [HTAN's Data Portal](https://humantumoratlas.org/explore). + +Metadata is submitted to HTAN via the Synapse Data Curator App [(DCA)](https://dca.app.sagebionetworks.org/), developed and maintained by [Sage Bionetworks](https://sagebionetworks.org/). The DCA performs several automated validation checks to make sure the metadata complies with the HTAN Data Model. Please see [Submitting Assay Data and Metadata](../data_submission/clin_biospec_assay.md) for more information about the DCA. + +!!! Terminology Alert +The term "manifests" refers to the spreadsheets used to submit metadata. "Metadata templates" are available via the DCA. These are manifests which can be filled out, validated and submitted using the DCA's web interface. +!!! \ No newline at end of file diff --git a/data_submission/overview.md b/data_submission/overview.md new file mode 100644 index 00000000..86a53ffc --- /dev/null +++ b/data_submission/overview.md @@ -0,0 +1,18 @@ +--- +order: 1000 +--- + +# Data Submission Overview +Only HTAN Centers and Associate Members can submit data to the HTAN Network's repositories. The Data Submission Section of this Manual is intended as a guide for HTAN Centers and Associate Members. + +:exclamation: *Prior to submitting data, all data must be de-identified. Please see [Data De-identification](../data_submission/Data_Deidentification.md) for more information.* + +Data Submission involves two key steps: +1. Uploading assay data files to Synapse; and +2. Completing and validating manifests using the Data Curator App (DCA). + +![Data Submission Overview](../img/Data_Submit_Overview.svg) + +Specific details regarding data submission and the DCA are included in later sections of this manual. Please contact your [Data Liaison](../data_submission/Data_Liaisons.md) if you have any questions or issues. Please also keep your data liaison informed of any data submissions. + +The current status of data uploads (refreshed every 4 hours) is available on the [HTAN Dashboard](http://hdash.website-us-east-1.linodeobjects.com/index.html). \ No newline at end of file diff --git a/data_submission/specific_details.md b/data_submission/specific_details.md new file mode 100644 index 00000000..c1433d8a --- /dev/null +++ b/data_submission/specific_details.md @@ -0,0 +1,27 @@ +--- +order: 993 +--- + +# Specific Assay/Data Element Details + +Please see [Data Standards](https://data.humantumoratlas.org/standards) for an overview of HTAN Data Levels and Metadata Attributes for each data type. The following links provide specific submission details for each data type. + +!!! under development. +:construction: Currently this page contains additional information for Imaging data. The Data Coordinating Center (DCC) plans to develop additional documents which will be linked from this page at a later time. +!!! + +Accessory Files + +Biospecimen + +Clinical Data + +[Imaging](https://docs.google.com/document/d/1iNicigsSytekEQLkmeNJd2NOJ9VTKzBDfYj3BmvGcro/edit#heading=h.b6j67xcu50c2) + +RPPR + +Sequencing Data + +Spatial Transcriptomics + + diff --git a/img/Data_Pub_Overview.svg b/img/Data_Pub_Overview.svg new file mode 100644 index 00000000..d6bfb046 --- /dev/null +++ b/img/Data_Pub_Overview.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/img/Data_Submit_Overview.svg b/img/Data_Submit_Overview.svg new file mode 100644 index 00000000..39d5aedf --- /dev/null +++ b/img/Data_Submit_Overview.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/img/Data_release.svg b/img/Data_release.svg new file mode 100644 index 00000000..67878ebc --- /dev/null +++ b/img/Data_release.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/img/Data_submission.svg b/img/Data_submission.svg new file mode 100644 index 00000000..6374b85e --- /dev/null +++ b/img/Data_submission.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/img/metadata.svg b/img/metadata.svg new file mode 100644 index 00000000..82eee891 --- /dev/null +++ b/img/metadata.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/open_access/introduction.md b/open_access/introduction.md index 2e4b944b..75e7f9dd 100644 --- a/open_access/introduction.md +++ b/open_access/introduction.md @@ -4,8 +4,14 @@ order: 1000 # Open Access Data -Open access HTAN data is available via: +The [HTAN Data Portal](https://humantumoratlas.org/explore) provides an overview of all released data. For Open Access data, the Portal also provides links to: +- downloadable files on Synapse; +- CellxGene and Xena for visualization of single cell RNA-sequencing data; +- Minerva rendered images and stories; +- Google BigQuery tables; and +- cBioPortal. -- The HTAN Data Portal -- NCI Image Data Commons (IDC) -- Google BigQuery +Google BigQuery tables provide direct access to a subset of assay data (mainly level 4 files) on a cloud platform. The Google BigQuery tables provide an easy way to build cohorts using specific data fields. HTAN also provides sample code for working with the data in Google BigQuery. + +!!! For more detailed information about the HTAN Portal, CellxGene, Minerva-rendered Images, accessing Images via SB-CGC and/or Using Google BigQuery, please see the next sections of this manual. +!!! \ No newline at end of file diff --git a/overview/centers.md b/overview/centers.md index 601948eb..c3bc3082 100644 --- a/overview/centers.md +++ b/overview/centers.md @@ -4,7 +4,7 @@ order: 999 # HTAN Centers -The HTAN Network consists of ten research centers, and two pilot projects. We also run multiple trans-network projects, referred to as TNPs. Each research center or TNP Project is identified with a unique HTAN prefix. +HTAN currently consists of ten research centers, and two pilot projects. There are also multiple trans-network projects, referred to as TNPs. Each research center or TNP Project is identified with a unique HTAN prefix. | Prefix | Contact Institution or Project Name | Atlas Type | Area of Focus | | ------ | --------------------------------------- | ---------------- | --------------------------------- | @@ -22,5 +22,7 @@ The HTAN Network consists of ten research centers, and two pilot projects. We al | HTA12 | Washington University in St. Louis | Tumor Atlas | Multiple Cancer Types | | HTA13 | TNP: SARDANA | TNP Atlas | Technology Comparison | | HTA14 | TNP: Tissue MicroArray (TMA) | TNP Atlas | Technology Comparison | +| HTA15 | TNP: SRRS | TNP Atlas | Technology Comparison | +| HTA16 | TNP: CASI | TNP Atlas | Technology Comparison | For details on each center, please see: https://humantumoratlas.org/research-network. diff --git a/overview/data_levels.md b/overview/data_levels.md index 526f2fa3..df8deee3 100644 --- a/overview/data_levels.md +++ b/overview/data_levels.md @@ -2,7 +2,7 @@ order: 998 --- -# HTAN Data Levels +# HTAN Data Access Levels HTAN data is categorized into **two data access levels**: diff --git a/overview/introduction.md b/overview/introduction.md index 856ee75f..cc778cf5 100644 --- a/overview/introduction.md +++ b/overview/introduction.md @@ -6,14 +6,16 @@ order: 1000 The **Human Tumor Atlas Network (HTAN)** is a National Cancer Institute-funded Cancer Moonshot initiative focused on studying the **transitions of human cancers** as they evolve from **precancerous lesions to advanced disease**. -The network consists of **ten research centers** and a **Data Coordinating Center (DCC)**. Five of the research centers are focused on developing **pre-cancer atlases**, and the remaining five centers are focused on developing **tumor atlases**. We also have two pilot projects, one focused on pre-cancer atlases, and one focused on tumor atlases. +In the current phase of HTAN (phase 1), the network consists of **ten research centers** and a **Data Coordinating Center (DCC)**. Five of the research centers are focused on developing **pre-cancer atlases**, and the remaining five centers are focused on developing **tumor atlases**. We also have two pilot projects, one focused on pre-cancer atlases, and one focused on tumor atlases. Each research center is responsible for gathering and processing samples, and running their own experimental assays. Assays vary by center, but most centers have a strong focus on **single cell RNA-Seq** and a wide range of **multiplex imaging modalities**. All centers are required to submit their clinical, biospecimen and assay data to the HTAN DCC using a **common HTAN Data Model**. The DCC makes HTAN data available to the wider scientific community. !!! :zap: Important Links :zap: -Complete information regarding the HTAN network is available at: https://humantumoratlas.org/. +Complete information regarding HTAN is available at: https://humantumoratlas.org/. + +Please see [HTAN Data: A Gentle Introduction](https://cancerai.substack.com/p/htan-data-a-gentle-introduction) for an overview of HTAN Data. You can explore all open access data within the HTAN Data Portal at: https://humantumoratlas.org/explore. diff --git a/readme.md b/readme.md index ee84c94d..b75c58db 100644 --- a/readme.md +++ b/readme.md @@ -1,17 +1,20 @@ -# HTAN: The Missing Manual +# The HTAN Manual -Written by the HTAN Data Coordinating Center (DCC), with contributions from Adam Taylor, Clarisse Lau, Vésteinn Thorsson, Ino de Bruijn, David Gibbs, Ethan Cerami and Alex Lash. +Written by the HTAN Data Coordinating Center (DCC), with contributions from Adam Taylor, Clarisse Lau, Vésteinn Thorsson, Ino de Bruijn, David Gibbs, Ethan Cerami, Alex Lash and Jen Altreuter. ## About this Manual -**HTAN: The Missing Manual** provides an overview of Human Tumor Atlas Network (HTAN) data and the various modes of data access. If you have any questions regarding the manual or HTAN data, please contact us at: htan@googlegroups.com. +**The HTAN Manual** provides an overview of Human Tumor Atlas Network (HTAN) data, including the various levels of data access. If you have any questions regarding the manual or HTAN data, please contact us: [HTAN Help Desk](https://sagebionetworks.jira.com/servicedesk/customer/portal/1). The manual can be found at https://docs.humantumoratlas.org/. +If you have feedback for this manual, including broken links or incorrect information, please submit a ticket to the [HTAN Help Desk](https://sagebionetworks.jira.com/servicedesk/customer/portal/1). + ## Content Updates -| Date | Comment | -|------------|--------------------------| -| 2023-06-01 | Second version of manual | -| 2022-09-28 | First version of manual. | +| Date | Comment | Changes summary | +|------------|--------------------------|-----------------| +| 2024-04-01 | Third version of manual | Simplified Data Model section; added "Submitting Data" and "Additional Information" Sections | +| 2023-06-01 | Second version of manual | | +| 2022-09-28 | First version of manual | | diff --git a/retype.yml b/retype.yml index 00f2ad5b..ef9b0059 100644 --- a/retype.yml +++ b/retype.yml @@ -1,6 +1,6 @@ input: . output: .retype -url: # Add your website address here +url: jen-dfci.github.io/htan_missing_manual/ branding: title: HTAN label: Manual