Skip to content

pyNBS.pyNBS_single.NBS_single

Justin Huang edited this page Feb 3, 2018 · 10 revisions

This is the primary function that wraps a single iteration of the core steps of the NBS algorithm. For each call of this function, pyNBS will perform the following steps:

  1. Subsample binary somatic mutation data (see the subsample_sm_mat function for more details).
  2. Propagate binary somatic mutation data over the given molecular network (see the network_propagation or network_kernel_propagation function for more details).
  3. Quantile normalize the network-smoothed data (see the qnorm function for more details).
  4. Perform network-regularized non-negative matrix factorization (netNMF) (see the netNMF function for more details).

However, this function is written to be intentionally flexible such that there are ways for the user to avoid this function from executing any of the specific steps above.


Function Call:

NBS_single(sm_mat, regNet_glap, propNet=None, propNet_kernel=None, k=3, verbose=False, **kwargs)

Parameters:

  • sm_mat (required, pandas.DataFrame): Binary somatic matrix loaded from file. This is a matrix of the binary somatic mutation profiles of the cohort to perform pyNBS on. Rows are patients/samples and the columns are genes that are mutated in those tumors.

  • regNet_glap (required, pandas.DataFrame): Graph laplacian (gene-by-gene) of KNN influence network constructed by the network_inf_KNN_glap function. This is the regularization matrix for the netNMF step.

  • propNet (optional, Networkx.Graph, default=None): NetworkX object loaded from network file. If no network is given, then the subsampled somatic mutation data will not be restricted to any network gene space, nor will the sm_mat data be propagated over any network.

  • propNet_kernel (optional, pandas.DataFrame, default=None): This is the output (gene-by-gene) of the network_propagation function when using it to construct the "network propagation kernel" described in the documentation. If no propNet_kernel is given, this function will propagate the sm_mat data using the native network_propagation function. However, if propNet_kernel is pre-computed and passed to this function, a significant speed up in the network propagation step can be had, especially if this function is called multiple times. Giving propNet_kernel will call the network_kernel_propagation function.

  • k (optional, int, default=3): Number of components to decompose patient mutation data into during the netNMF step. This is also the same as the number of clusters of patients to separate data into.

  • **kwargs (optional, dict, default=None): Dictionary of parameters to control functions being called by this function. Many of these parameters are the same as those described in the parameters file page and in the documentation of the associated function the kwarg parameter is used in. All keys and values in the dictionary are expected to be strings, but may be cast to other data types, see descriptions for specific kwarg details.

    • kwargs['gene_subsample_p']: Proportion of columns (mutated genes) in sm_mat to subsample when performing sm_mat_subsample. Range is (0.0-1.0] and the value must be able to be converted to a Python float. The default value of this parameter is 0.8. Setting this value to 1 will simply shuffle the sm_mat data columns, but not cause any subsampling of the columns. See sm_mat_subsample for more details.
    • kwargs['iteration_label']: A string containing a file indicator for the files saved in kwargs['outdir'] to keep track of which pyNBS iteration this propagation profile corresponds to. Otherwise all files will be saved with the same base name (see kwargs['job_name']).
    • kwargs['job_name']: A string containing a file prefix for the H matrix and propagated profiles (if specified by kwargs['save_prop']) saved in kwargs['outdir']. Otherwise the base file name will default to H.csv (or prop.csv for propagated profiles)
    • kwargs['min_muts']: Minimum number of mutations required in each subsampled sm_mat row from sm_mat_subsample to be included in final subsampled data matrix. The value cannot be less than 0 and can be as large as the user wants. There is no maximum number of mutations to be set currently, but if the number is too large, sm_mat_subsample may return an empty result. The value value must be able to be converted to a Python int and the default value of this parameter is 10. See sm_mat_subsample for more details.
    • kwargs['netNMF_err_delta_tol']: This is the minimum error tolerance for the L2 norm of difference in matrix reconstructions between iterations of netNMF for convergence. If the reconstruction error of the decomposition is not improving significantly, the function will return the H matrix from the decomposition at that time. The value value must be able to be converted to a Python float and the default value of this parameter is 1e-8. See netNMF for more details.
    • kwargs['netNMF_err_tol']: This is the minimum error tolerance for matrix reconstruction of original data for the netNMF to reach convergence. If the decomposition has reached a sufficiently close estimation of the original data passed into the netNMF function, the function will return the H matrix from that decomposition at that time. The value value must be able to be converted to a Python float and the default value of this parameter is 1e-4. See netNMF for more details.
    • kwargs['netNMF_eps']: Epsilon error value to adjust 0 (or very small) values during multiplicative matrix updates in netNMF. Essentially this is a parameter to define the machine precision for the netNMF step. The value value must be able to be converted to a Python float and the default value of this parameter is 1e-15. Changing this parameter may lead to unexpected behavior, but is accessible for algorithm testing purposes.
    • kwargs['netNMF_gamma']: This is the regularization constant to scale network regularizer (knnGlap) term in netNMF. The value value must be able to be converted to a Python int and the default value of this parameter is 200. We have found that larger positive integers for this value produce better, and more robust results. We suggest using a value between 100-1000 for this parameter. Setting this value to 0 will perform netNMF with no network regularization penalty (similar to a non-network-regularized NMF). See netNMF for more details.
    • kwargs['netNMF_maxiter']: Maximum number of multiplicative updates to perform within the netNMF if the result does not reach convergence by a different method. See netNMF for more details.
    • kwargs['netNMF_verbose']: Verbosity flag for determining whether or not to have the netNMF function report intermediate progress at each iteration. The default value for this is 'False'. Passing 'True' for this parameter will turn on stdout reporting of the netNMF function.
    • kwargs['outdir']: A string containing the directory path of which to save the resulting H matrix of this function. If this parameter is given within **kwargs, the function will automatically write the output H matrix of this function as a .csv to this location. Also defines the save location of propagated profiles to save if specified by kwargs['save_prop'].
    • kwargs['pats_subsample_p']: Proportion of rows (patients/samples) in sm_mat to subsample when performing sm_mat_subsample. Range is (0.0-1.0] and the value must be able to be converted to a Python float. The default value of this parameter is 0.8. Setting this value to 1 will simply shuffle the sm_mat data rows, but not cause any subsampling of the rows. See sm_mat_subsample for more details.
    • kwargs['prop_alpha']: This parameter is only used if a propNet is given and no propNet_kernel is given to the function. The propagation constant to use in the propagation of mutations over molecular network. Range is 0.0-1.0 exclusive and the value must be able to be converted to a Python float. The default value of this parameter is 0.7. See network_propagation for more details.
    • kwargs['prop_symmetric_norm']: This parameter is only used if a propNet is given and no propNet_kernel is given to the function. Parameter for determining whether or not to perform a symmetric degree normalization on the adjacency matrix during propagation. The default value for this is 'False'. Passing 'True' for this parameter will perform symmetric normalization on the molecular network for propagation. This value can also be a a boolean denoting True or False. See normalize_network for additional details.
    • kwargs['qnorm_data']: Parameter for determining whether or not to perform quantile normalization on the network-smoothed data. The default value for this is the string 'True'. Any other value will prevent quantile normalization (the only other exception is the boolean True). See the qnorm function for more details.
    • kwargs['save_prop']: This parameter is only used if a propNet is given (so this function will now call a network propagation function. Parameter for determining whether or not to save the propagation result constructed by this function. Passes **kwargs to the the appropriate network propagation function called and those functions will receive **kwargs as their **save_args parameters. **save_args parameters here are the same as kwargs['outdir'], kwargs['job_name'], and kwargs['iteration_label'] described here. This value can either be a string or a boolean denoting True or False.

Returns:

  • prop_data_df (pandas.DataFrame): The network-smoothed somatic mutation profiles.
Clone this wiki locally