UCSC Xena is an online tool that allows users to analyze and visualize genomics data[1]. The website supports various data types from multiple platforms such as the Genomics Data Commons (GDC)[2]. Data types supported by the GDC Xena hub include RNA-seq gene expression, DNA methylation, clinical data, Gene-level mutation, survival, and more. The data from the GDC was last imported into UCSC Xena 4 years ago using the Xena-GDC-ETL code[7]. This means that Xena users lack access to the newer and updated GDC data. The quantity of data that needs to be updated is significant, which makes it impossible to completely test the data manually. This project aims to create a set of automatic testing scripts to efficiently validate the accuracy of each data type imported from the GDC through the Xena-GDC-ETL code[7], both for data from existing projects on the Xena browser such as The Cancer Genome Atlas (TCGA)[5], as well as new projects like CTSP-DLBCL1[6] to the Xena platform. These scripts will complement some manual spot-checking as well as some manual tests which can not be automated.
The NCI's Genomic Data Commons (GDC) provides the cancer research community with a unified repository and cancer knowledge base that enables data sharing across cancer genomic studies in support of precision medicine. The GDC provides public access to data from important national and international cancer research projects such as TCGA, Therapeutically Applicable Research to Generate Effective Treatments (TARGET) [3], and Genomics Evidence Neoplasia Information Exchange (GENIE)[4]. This data can be retrieved and downloaded using the GDC API’s various endpoints, such as “/cases”, “/files”, “/projects”, etc. These endpoints are used by the Xena-GDC-ETL to import data, transform the data into files that are compatible with Xena, and transfer it to the Xena hub.
A multitude of researchers use data from the GDC database on the Xena browser, making the accuracy of this data a major priority. The data is too large to be tested entirely by hand, hence it is crucial to automatically test the data from the GDC. This project consists of:
- Understanding the GDC API’s search, retrieval, and download.
- Develop multiple automatic tests for GDC’s RNA-seq data
- Develop an automatic test for GDC survival data
- Develop an automatic test for GDC clinical data
The first project chosen to import from the GDC using Xena-GDC-ETL was CTSP-DLBCL1. This is due to the project’s small case size of 45 and the fact that it was limited in its types of genomic data to only RNA-seq data.
I wrote a script that compared the four types of gene expression data (counts, FPKM, FPKM-UQ, and TPM) in the GDC with the .tsv file generated by the Xena-GDC-ETL code. First, my script independently downloads the RNA-seq data using the GDC API. Then it reformats both files into a pandas data frame[8] to compare every value on a cell-by-cell basis. Any inconsistencies in the data are reported to the command line once the script is finished. Each cell’s value, gene ID, and sample ID are saved into a list that is printed out at the end of the program.
In addition to comparing the values cell-by-cell, I also created a test to find the correlation between the different data types of RNA-seq data and ensure that the values are close to one. I wrote a script to run a Pearson correlation[8] between all four RNA-seq data types, resulting in six comparisons. This is done on a sample-by-sample basis, meaning the correlation test only checks one sample, but every gene. However, counts are the number of sequence reads of a gene. Because gene sizes may vary significantly, this means that larger genes will generally have more counts. This is different from FPKM, FPKM-uq, and TPM, which are normalized to one million read counts. This means that these three data types measure the density of read counts, while counts are the total number of read counts. Due to this difference, any correlation test with counts is done on a gene-by-gene basis, meaning that only one gene is measured for every sample. This is done to prevent any variation between genes. All of the values are plotted onto a scatter plot with the axes being data types.
For the clinical data I was not able to simply download the clinical data from the GDC API to compare to the .tsv file generated by the Xena-GDC-ETL code because the Xena-GDC-ETL code deletes fields that have no data. This is common for the clinical data from the GDC as each project reports different clinical data depending on the cancer type, project stage, and which fields the project was able to collect as part of the project. In light of this, the script I wrote first reads the fields from the .tsv file generated by the Xena-GDC-ETL. These fields are then used to limit the output from the calls to the GDC API clinical endpoint to only these fields. The results from the GDC API are loaded into a pandas data frame. Then, similar to the RNA-seq automated testing, the .tsv file is converted to a pandas data frame to compare each value. If there are multiple samples per case, the sample IDs are saved and run through the program a second time.
The GDC API has a specific endpoint for survival data called: “/analysis/survival”, which the Xena-GDC-ETL code uses this endpoint to retrieve survival data. This is a separate endpoint from the "cases" clinical data endpoint. I created a script to retrieve survival data of a specified project using the “/analysis/survival” endpoint, convert it into a data frame, and compare it with the Xena-GDC-ETL generated data frame, similar to the testing script for the RNA-seq data.
Further, I created an automated test to verify the internal consistency within the GDC between the survival and "cases" clinical endpoints. It first retrieves the relevant survival fields for a specific project using the GDC API’s “/cases” endpoint. This endpoint returns data on a case-by-case basis, which is then formatted into a data frame. From here it takes the 'vital status' as whether the individual is alive or dead. It then takes the greatest value of the following fields: “days to death”, “days to last follow up”, “days to last known disease status”, “days to recurrence”, and more, as these are all related to the calculation of time to event, whether that is being alive or dead. Finally, the .tsv file output by the Xena-GDC-ETL code is converted to a data frame as well, so that every cell can be compared. While the first test was conducted on the CTSP-DLBCL1 project, we also conducted multiple tests on TCGA-BRCA and TCGA-UCEC.
Each automated testing script was run on the CTSP-DLBCL1 project. When testing RNA-seq data of CTSP-DLBCL1 imported by the Xena-GDC-ETL, all data matched, meaning that the import done by the Xena-GDC-ETL code was accurate.
There were several inconsistencies between the survival and clinical data. When using the “/cases” endpoint, several cases that were alive but had follow-up data, were missing from the GDC survival endpoint. These inconsistencies were in the CTSP-DLBCL1 project (an example is CTSP-AD1S), but not in the TCGA-BRCA or TCGA-UCEC projects. Initial analysis of the clinical and survival endpoints leads me to believe that this is due to the fact that the CTSP-DLBCL1 projects submitted follow-up data to the field: ‘days to follow up’, whereas the TCGA project submitted to the field ‘days to last follow up’. Further analysis and follow up are needed to verify this. Despite this internal consistency within the GDC, the Xena-GDC-ETL import worked as expected where, for the project CTSP-DLBCL1, the “/analysis/survival” endpoint leads to the same results as the Xena-GDC-ETL import.
Another example of different endpoints producing different results can be found in the results of the clinical data automated testing. The automated testing script uses the “/cases” endpoint, while the Xena-GDC-ETL code uses the “/projects” endpoint. Due to this difference, only one diagnosis of a case is retrieved by Xena-GDC-ETL, compared to all diagnoses by the automated testing script. An example of this is the case, CTSP-AD1S, which has 2 diagnoses, but only one is found in the Xena-GDC-ETL.
Correlation: 0.9999999999999867
Correlation: 0.999999999999983
Correlation: 0.9880884397613635
Correlation: 0.9999999999999917
Correlation: 0.9545050830065851
Correlation: 0.9587627363186685
Highly significant positive correlation between various RNA-seq data types was generally expected. Correlation values between TPM, FPKM, and FPKM-UQ were all nearly 1, which is expected as they’re normalized per million reads. When comparing counts, which are not normalized, to other data types, the correlation values were lower, between 0.94 and 0.98. This is still close to 1.
Command line:
[This file name] [Xena-GDC-ETL imported .tsv file name] [data type] [debug mode]
Valid data types: [“fpkm”, “fpkm-uq”, “tpm”, “star_counts”]
Valid debug modes: [“True”, “False”]
Example:
python3 XenaGeneExpressionMatrixValidation.py /Users/Downloads/CTSP-DLBCL1.star_tpm.tsv tpm False
Example results:
Command line: [This file name] [Xena-GDC-ETL imported .tsv file name(data type 1)] [file name(data type 2)] [file name(data type 3)] [file name(data type 4)]
Example:
python3 RNAseqPCC.py
/Users/Downloads//CTSP-DLBCL1.star_tpm.tsv /Users/Downloads/CTSP-DLBCL1.star_fpkm-uq.tsv /Users/Downloads/CTSP-DLBCL1.star_fpkm.tsv /Users/Downloads/CTSP-DLBCL1.star_counts.tsv
Example results:
Command line:
[This file name] [Xena-GDC-ETL imported .tsv file] [project id]
Example:
python3 XenaSurvivalMatrixValidation.py /Users/Downloads/CTSP-DLBCL1.survival.tsv CTSP-DLBCL1
Example results:
Command line:
[This file name] [Xena-GDC-ETL imported .tsv file] [project id]
Example:
python3 XenaSurvivalAnalysisEndptValidation.py /Users/Downloads/CTSP-DLBCL1.survival.tsv CTSP-DLBCL1
Example results:
Command line:
[This file name] [Xena-GDC-ETL imported .tsv file]
Example:
python3 XenaClinicalDataValidation.py /Users/Downloads/CTSP-DLBCL1.clinical.tsv
Example results:
[1]Goldman, M.J., Craft, B., Hastie, M. et al. Visualizing and interpreting cancer genomics data via the Xena platform. Nat Biotechnol (2020). https://doi.org/10.1038/s41587-020-0546-8
[2]Heath AP, Ferretti V, Agrawal S, An M, Angelakos JC, Arya R, Bajari R, Baqar B, Barnowski JHB, Burt J, Catton A, Chan BF, Chu F, Cullion K, Davidsen T, Do PM, Dompierre C, Ferguson ML, Fitzsimons MS, Ford M, Fukuma M, Gaheen S, Ganji GL, Garcia TI, George SS, Gerhard DS, Gerthoffert F, Gomez F, Han K, Hernandez KM, Issac B, Jackson R, Jensen MA, Joshi S, Kadam A, Khurana A, Kim KMJ, Kraft VE, Li S, Lichtenberg TM, Lodato J, Lolla L, Martinov P, Mazzone JA, Miller DP, Miller I, Miller JS, Miyauchi K, Murphy MW, Nullet T, Ogwara RO, Ortuño FM, Pedrosa J, Pham PL, Popov MY, Porter JJ, Powell R, Rademacher K, Reid CP, Rich S, Rogel B, Sahni H, Savage JH, Schmitt KA, Simmons TJ, Sislow J, Spring J, Stein L, Sullivan S, Tang Y, Thiagarajan M, Troyer HD, Wang C, Wang Z, West BL, Wilmer A, Wilson S, Wu K, Wysocki WP, Xiang L, Yamada JT, Yang L, Yu C, Yung CK, Zenklusen JC, Zhang J, Zhang Z, Zhao Y, Zubair A, Staudt LM, Grossman RL. The NCI Genomic Data Commons. Nat Genet. 2021 Mar;53(3):257-262. doi: 10.1038/s41588-021-00791-5. Erratum in: Nat Genet. 2021 Jun;53(6):935. PMID: 33619384.
[3]Ma X, Edmonson M, Yergeau D, Muzny DM, Hampton OA, Rusch M, Song G, Easton J, Harvey RC, Wheeler DA, Ma J, Doddapaneni H, Vadodaria B, Wu G, Nagahawatte P, Carroll WL, Chen IM, Gastier-Foster JM, Relling MV, Smith MA, Devidas M, Guidry Auvil JM, Downing JR, Loh ML, Willman CL, Gerhard DS, Mullighan CG, Hunger SP, Zhang J. Rise and fall of subclones from diagnosis to relapse in pediatric B-acute lymphoblastic leukaemia. Nat Commun. 2015 Mar 19;6:6604. doi: 10.1038/ncomms7604. PMID: 25790293; PMCID: PMC4377644.
[4]Micheel CM, Sweeney SM, LeNoue-Newton ML, André F, Bedard PL, Guinney J, Meijer GA, Rollins BJ, Sawyers CL, Schultz N, Shaw KRM, Velculescu VE, Levy MA; AACR Project GENIE Consortium. American Association for Cancer Research Project Genomics Evidence Neoplasia Information Exchange: From Inception to First Data Release and Beyond-Lessons Learned and Member Institutions' Perspectives. JCO Clin Cancer Inform. 2018 Dec;2:1-14. doi: 10.1200/CCI.17.00083. PMID: 30652542; PMCID: PMC6873906.
[5]National Cancer Institute, 2023 The Cancer Genome Atlas Program (TCGA). www.cancer.gov/ccg/research/genome-sequencing/tcga Accessed 19 June 2023.
[6]Schmitz R, Wright GW, Huang DW, Johnson CA, Phelan JD, Wang JQ, Roulland S, Kasbekar M, Young RM, Shaffer AL, Hodson DJ, Xiao W, Yu X, Yang Y, Zhao H, Xu W, Liu X, Zhou B, Du W, Chan WC, Jaffe ES, Gascoyne RD, Connors JM, Campo E, Lopez-Guillermo A, Rosenwald A, Ott G, Delabie J, Rimsza LM, Tay Kuang Wei K, Zelenetz AD, Leonard JP, Bartlett NL, Tran B, Shetty J, Zhao Y, Soppet DR, Pittaluga S, Wilson WH, Staudt LM. Genetics and Pathogenesis of Diffuse Large B-Cell Lymphoma. N Engl J Med. 2018 Apr 12;378(15):1396-1407. doi: 10.1056/NEJMoa1801445. PMID: 29641966; PMCID: PMC6010183.
[7] Xena-GDC-ETL, https://github.com/ucscXena/xena-GDC-ETL Accessed 11 August 2023.
[8] pandas - Python Data Analysis Library. (n.d.), https://pandas.pydata.org/ Accessed 14 August 2023.