Skip to content

pyNBS Parameters File

Tongqiu (Iris) Jia edited this page Jan 29, 2018 · 8 revisions

Parameter File Format

The parameter configuration file is a 2-column comma-separated text file where the first column is the parameter name, and the second column is the parameter value. The delimiter for this file must be a comma.

Notes about the parameter file:

  • The usage of this parameter file will be for the command line script execution of pyNBS.
  • This file will be read in by the load_params function.
  • If no parameter file path is given, default parameters will be set instead (see documentation for details and default values).
  • Blank lines and lines starting with "#" will be ignored.
  • The parameter file may include as many or as few of the parameters from the pyNBS overall parameter space (see all possible parameters below). For examples of two parameter files see: ./OV_run_pyNBS_Hofree_params.csv VS ./run_pyNBS_default_params.csv

An excerpt of the the full default parameters file is given below:

################################  
#   Overall pyNBS Parameters   #  
################################  
verbose,True  
outdir,./Results/  
  
###############################  
#   Data Loading Parameters   #  
###############################  
net_filedelim,"	"  
mut_filetype,matrix  
mut_filedelim,","  
degree_preserved_shuffle,False  
node_label_shuffle,False  

Parameter Details

All parameters that can be edited by this file are described below. For additional details of each parameter, please see the linked function.

Overall pyNBS parameters

  • verbose (bool, default=True): Verbosity flag for reporting on function progress.
  • job_name (str, default = ‘pyNBS’): Filename prefix used to tag a particular run of pyNBS.
  • outdir (str,default = ‘./Results/’): Path to output directory. pyNBS will attempt to create the directory at the file path if it does not already exist. Default output folder will be current working directory unless otherwise defined by params_file.

Data Loading Parameters

  • net_filedelim (str, default='\t'): Delimiter used in network file between columns. This parameter is the delimiter parameter in load_network_file.
  • mut_filetype (str, default= 'matrix'): File structure of binary mutation data. There are two options: matrix or list. This parameter is the filetype parameter in load_binary_mutation_data.
  • mut_filedelim (str, default= ','): Delimiter used in binary mutation file. This parameter is the delimiter parameter in load_binary_mutation_data.
  • degree_preserved_shuffle (bool,default=False): Determination of whether or not to shuffle the network edges (while preserving node degree) when loading network. This parameter is the degree_shuffle parameter in load_network_file.
  • node_label_shuffle (bool, default=False): Determination of whether or not to shuffle the network node labels (while preserving network topology) when loading network. This parameter is the label_shuffle parameter in load_network_file.

K-Nearest Neighbors Network Construction Parameters (Used in network-regularized NMF)

  • reg_net_gamma (float, default=0.01): Constant value to add to the diagonal of molecular network graph laplacian to calculate influence matrix for regularization network construction. This parameter is the gamma parameter in network_inf_KNN_glap.
  • k_nearest_neighbors (int, default=11): Number of nearest neighbors to add to the regularization network during construction. This parameter is the kn parameter in network_inf_KNN_glap.
  • save_knn_glap (bool,default=False): Parameter to determine whether or not to save regularization network graph laplacian. This parameter is used in command line script run_pyNBS.

Data Subsampling Parameters

  • pats_subsample_p (float, default=0.8): Proportion of rows (patients/samples) in sm_mat to subsample when performing subsampling. Range is (0.0-1.0] and the value must be able to be converted to a Python float. Setting this value to 1 will simply shuffle the sm_mat data rows, but not cause any subsampling of the rows. This parameter is used in function subsample_sm_mat.
  • gene_subsample_p (float, default=0.8): Proportion of columns (mutated genes) in sm_mat to subsample when performing subsampling. Range is (0.0-1.0] and the value must be able to be converted to a Python float. Setting this value to 1 will simply shuffle the sm_mat data columns, but not cause any subsampling of the columns. This parameter is used in function subsample_sm_mat.
  • min_muts (int, default=10): Minimum number of mutation counts for filtering. This parameter is used in function subsample_sm_mat.

Network Propagation Parameters

  • prop_alpha (float, default=0.7): Propagation constant to use in the propagation of mutations over molecular network. Range is 0.0-1.0 exclusive. This parameter is the parameter alpha in function network_propagation and network_kernel_propagation.
  • prop_symmetric_norm (bool, default=False): Parameter for determining whether or not to perform a symmetric degree normalization on the adjacency matrix (see normalize_network for additional details). This parameter is the parameter symmetric_norm in function network_propagation and normalize_network.
  • save_kernel (bool, default=False): Parameter for determining whether or not to save network propagation kernel. This parameter is used in command line script run_pyNBS.
  • save_prop (bool, default=False): Parameter for determining whether or not to save propagated, sub-sampled data at each intermediate step. This parameter is used in command line script run_pyNBS.
  • qnorm_data (bool, default=True): Parameter for determining whether or not to perform quantile normalization on the network-smoothed data. The default value for this is 'True'. Any other value will prevent quantile normalization. See the qnorm function for more details. This parameter is used in the **kwargs dictionary of NBS_single function.

Network-Regularized NMF Decomposition Parameters

  • netNMF_k (int, default=3): Number of components to decompose patient mutation data into during the netNMF. This is also the same as the number of clusters of patients to separate data into. This parameter is used as parameter k in mixed_netNMF and NBS_single function.
  • netNMF_gamma (int, default=200): This is the regularization constant to scale network regularizer term in netNMF. The value value must be able to be converted to a Python int and the default value of this parameter is 200. We have found that larger positive integers for this value produce better, and more robust results. We suggest using a value between 100-1000 for this parameter. Setting this value to 0 will perform netNMF with no network regularization penalty (similar to a non-network-regularized NMF). This parameter is the parameter l in mixed_netNMF function.
  • netNMF_maxiter (int, default=250): Maximum number of update steps to perform during this function if the result does not reach convergence by a different method. This parameter is the parameter maxiter in mixed_netNMF function.
  • netNMF_eps (float, default=1e-15): Epsilon error value to adjust 0 (or very small) values during multiplicative matrix updates in netNMF. Essentially this is a parameter to define the machine precision for the netNMF step. This parameter is the parameter eps in mixed_netNMF function.
  • netNMF_err_tol (float, default=1e-4): This is the minimum error tolerance for matrix reconstruction of original data for this function to reach convergence. If the decomposition has reached a sufficiently close estimation of data, the function will return the H factor matrix from that decomposition at that time. This parameter is the parameter err_tol in mixed_netNMF function.
  • netNMF_err_delta_tol (float, default=1e-8): This is the minimum error tolerance for the L2 norm of difference in matrix reconstructions between iterations of netNMF for convergence. If the reconstruction error of the decomposition is not improving significantly, the function will return the H factor matrix from the decomposition at that time. This parameter is the parameter err_delta_tol in mixed_netNMF function.
  • save_H (float, default=False): Parameter for determining whether or not to save individual H matrices to file. This parameter is used in command line script run_pyNBS.

Consensus Clustering Parameters

  • niter (int, default=100): Number of iterations to perform sub-sampling, propagation and network-regularized NMF before consensus clustering. This parameter is used in command line script run_pyNBS.
  • hclust_linkage_method (str, default='average'): The hiearchical clustering linkage method to use. Other methods are described in the scipy.cluster.hierarchy.linkage documentation. This parameter is used in consensus_hclust_hard function.
  • hclust_linkage_metric (str, default='euclidean'): The distance metric to use when constructing the linkage map of patients to be clustered in each H matrix. Other distance measures are described in the scipy.spatial.distance.pdist documentation. This parameter is used in consensus_hclust_hard function.
  • save_cc_results (bool, default=True): Parameter for determining whether or not to save consensus clustering results. This parameter is used in command line script run_pyNBS.
  • save_cc_map (bool, default=True): Parameter for determining whether or not to save patient co-clustering map. This parameter is used in command line script run_pyNBS.

Cluster Survival Analysis Parameters

  • plot_survival (bool, default=False): Parameter for determining whether or not to perform survival analysis. This parameter should only be set True when patient survival data is provided. This parameter is used in command line script run_pyNBS.
  • surv_file_delim( str, default='\t'): Delimiter used in the patient survival data file between columns. This is the parameter delimiter in cluster_KMplot function.
  • surv_lr_test (book, default=True): Parameter for determining whether or not to perform a multi=-variate log-rank test on the full set (over the full length) of survival curves in the resulting KM plot. If True, this function will return the p-value of the log-rank test and add it to the title of the plot, otherwise, only the plot will be generated. This parameter is the parameter lr_test in cluster_KMplot function.
  • surv_tmax (int, default=0): The number of days to cut off the KM plot display. The default (-1) shows the full length of all survival data, otherwise, surv_tmax should be a positive integer. Making a shorter surv_tmax will not affect the log-rank test p-values. This parameter is the parameter tmax in cluster_KMplot function.
Clone this wiki locally