Skip to content

pyNBS.pyNBS_core.subsample_sm_mat

Tongqiu (Iris) Jia edited this page Jan 27, 2018 · 4 revisions

This function performs subsampling of the rows and columns of the binary somatic mutation data. The subsampling procedure has the following steps:

Steps to construct subsampled sm_mat:

  1. Subsample rows (samples/tumors) and columns (network genes) of the binary somatic mutation matrix. The default is 80% of each axis. These subsampling proportions can be changed with the pats_subsample_p and gene_subsample_p parameter.
  2. Filter all rows with less than the minimum number of mutations. Default is 10 mutations. This can be changed with the min_muts parameter.
  3. Reduce binary somatic mutation data matrix to only contain columns of genes found in the network. If no network is given, this step is skipped.

Function Call:

subsample_sm_mat(sm_mat, propNet=None, pats_subsample_p=0.8, gene_subsample_p=0.8, min_muts=10)

Parameters:

  • sm_mat (required, pandas.DataFrame): Binary somatic matrix loaded from file. This is a matrix of the binary somatic mutation profiles of the cohort to subsample. Rows are patients/samples and the columns are genes that are mutated in those tumors.
  • propNet (optional, Networkx.Graph, default=None): NetworkX object loaded from network file. If no network is given, then the subsampled somatic mutation data will not be restricted to any network gene space.
  • pats_subsample_p (optional, float, default=0.8): Proportion of rows (patients/samples) in sm_mat to subsample when performing subsampling. Range is (0.0-1.0] and the value must be able to be converted to a Python float. Setting this value to 1 will simply shuffle the sm_mat data rows, but not cause any subsampling of the rows.
  • gene_subsample_p (optional, float, default=0.8): Proportion of columns (mutated genes) in sm_mat to subsample when performing subsampling. Range is (0.0-1.0] and the value must be able to be converted to a Python float. Setting this value to 1 will simply shuffle the sm_mat data columns, but not cause any subsampling of the columns.
  • min_muts (optional, int, default=10): Minimum number of mutation counts for filtering.

Returns:

  • gind_sample_filt (pandas.DataFrame): The subsampled sm_mat data. The dimensions of gind_sample_filt depend on the pats_subsample_p parameter, the gene_subsample_p parameter, and whether or not a network is passed to propNet.

Additional notes about this function:

This function is not required to perform the NBS algorithm specifically, but this function is called within the NBS_single function. However, when this function is called within the NBS_single function, NBS_single will check if this function returns an empty data frame.

Clone this wiki locally