-
Notifications
You must be signed in to change notification settings - Fork 22
pyNBS.pyNBS_core.subsample_sm_mat
Tongqiu (Iris) Jia edited this page Jan 27, 2018
·
4 revisions
This function performs subsampling of the rows and columns of the binary somatic mutation data. The subsampling procedure has the following steps:
Steps to construct subsampled sm_mat
:
- Subsample rows (samples/tumors) and columns (network genes) of the binary somatic mutation matrix. The default is 80% of each axis. These subsampling proportions can be changed with the
pats_subsample_p
andgene_subsample_p
parameter. - Filter all rows with less than the minimum number of mutations. Default is 10 mutations. This can be changed with the
min_muts
parameter. - Reduce binary somatic mutation data matrix to only contain columns of genes found in the network. If no network is given, this step is skipped.
subsample_sm_mat(
sm_mat, propNet=None, pats_subsample_p=0.8, gene_subsample_p=0.8, min_muts=10
)
- sm_mat (required, pandas.DataFrame): Binary somatic matrix loaded from file. This is a matrix of the binary somatic mutation profiles of the cohort to subsample. Rows are patients/samples and the columns are genes that are mutated in those tumors.
- propNet (optional, Networkx.Graph, default=None): NetworkX object loaded from network file. If no network is given, then the subsampled somatic mutation data will not be restricted to any network gene space.
-
pats_subsample_p (optional, float, default=0.8): Proportion of rows (patients/samples) in
sm_mat
to subsample when performing subsampling. Range is (0.0-1.0] and the value must be able to be converted to a Python float. Setting this value to1
will simply shuffle thesm_mat
data rows, but not cause any subsampling of the rows. -
gene_subsample_p (optional, float, default=0.8): Proportion of columns (mutated genes) in
sm_mat
to subsample when performing subsampling. Range is (0.0-1.0] and the value must be able to be converted to a Python float. Setting this value to1
will simply shuffle thesm_mat
data columns, but not cause any subsampling of the columns. - min_muts (optional, int, default=10): Minimum number of mutation counts for filtering.
-
gind_sample_filt (pandas.DataFrame): The subsampled
sm_mat
data. The dimensions ofgind_sample_filt
depend on thepats_subsample_p
parameter, thegene_subsample_p
parameter, and whether or not a network is passed topropNet
.
This function is not required to perform the NBS algorithm specifically, but this function is called within the NBS_single
function. However, when this function is called within the NBS_single
function, NBS_single
will check if this function returns an empty data frame.