Skip to content

Somatic Mutation Data File Format

Justin Huang edited this page Jan 27, 2018 · 7 revisions

The binary somatic mutation data file is loaded for usage by the pyNBS algorithm using the load_binary_mutation-data function. The binary somatic mutation data file can be represented in two file formats:

List Format

The default format for the binary somatic mutation data file is the list format. This file format is a 2-column csv or tsv list where the 1st column is a sample/patient and the 2nd column is a gene mutated in the sample/patient. There are no headers in this file format. Loading data with the list format is typically faster than loading data from the matrix format.The following text is the list representation of the matrix above.

TCGA-04-1638	A2M
TCGA-23-1029	A1CF
TCGA-23-2647	A2BP1
TCGA-24-1847	A2M
TCGA-42-2589	A1CF
Matrix Format

The matrix binary somatic mutation data format is a binary csv or tsv matrix with rows represent samples/patients and columns represent genes. The following table is a small excerpt of a matrix somatic mutation data file:

A1CF A2BP1 A2M
TCGA-04-1638 0 0 1
TCGA-23-1029 1 0 0
TCGA-23-2647 0 1 0
TCGA-24-1847 0 0 1
TCGA-42-2589 1 0 0

Note

  • If the user has a TCGA MAF file downloaded from The Broad Institute's Firehose, the user can use the process_TCGA_MAF function to construct a binary somatic mutation file that is usable by the pyNBS package.
  • All somatic mutation data used in our examples can be found here.
Clone this wiki locally