This code supplements the publication by Esser-Skala and Fortelny (2023).
(Not all of these folders are included in the git repository.)
data
: output files generated by P-NET and DTox (available from https://doi.org/10.5281/zenodo.7760561)doc
: project documentationdocker
: files for creating Docker containers with P-NET or DTox installedliterature
: relevant publicationsplots
: generated plotspnet_data
: P-NET data filesrenv
: R environment datascripts
: bash and R scripts
Download the provided Docker containers from the GitHub Container registry:
docker pull ghcr.io/csbg/pnet-container:1.0.0
docker pull ghcr.io/csbg/dtox-container:1.0.0
Alternatively, pull these containers with Apptainer/Singularity:
singularity pull docker://ghcr.io/csbg/pnet-container:1.0.0
singularity pull docker://ghcr.io/csbg/dtox-container:1.0.0
When using the latter container format, replace all calls to run_[pnet/dtox]_docker.sh
with run_[pnet/dtox]_singularity.sh
.
In order to run the R scripts, you will need one of the following:
-
an installation of R 4.3.1; restore required packages from
renv.lock
viaRscript -e "renv::restore()"`
-
the Docker container available from the GitHub Container registry:
docker pull ghcr.io/csbg/r_pnet_robustness:1.0.0
Replace calls to
Rscript
below byscripts/run_rscript_docker.sh
. -
the Apptainer/Singularity container:
singularity pull docker://ghcr.io/csbg/r_pnet_robustness:1.0.0
Replace calls to
Rscript
below byscripts/run_rscript_singularity.sh
.
Download the MSK-IMPACT 2017 dataset:
wget https://cbioportal-datahub.s3.amazonaws.com/msk_impact_2017.tar.gz
tar xzf msk_impact_2017.tar.gz -C pnet_data
Generally, each experiment comprises the following steps:
- Load P-NET input data via
load_data_[dataset].R
. - Optionally, modify input data via
modify_data_[technique].R
. - Run P-NET via Docker using the provided bash script
run_pnet_docker.sh
. This script has three arguments:
-e experiment
: experiment name, required-l [n]
: lower seed, optional (default: -1, which uses the original seeds)-u [n]
: upper seed, optional (default: 49)
Within each experiment, results from each run are saved in a subfolder indicating the two random seeds used (e.g., data/pnet_original/0_0
).
utils.R
is required by all data preparation scripts.
Run P-NET with the original setup as described in the publication.
Rscript scripts/load_data_original.R
scripts/run_pnet_docker.sh -e pnet_original
Input data is modified so that presence of mutation and copy number amplification is perfectly correlated with class label 1 (copy number deletion is always 0).
Rscript scripts/load_data_original.R
Rscript scripts/modify_data_deterministic.R
scripts/run_pnet_docker.sh -e pnet_deterministic
Shuffle training/test labels before each run using uniform class frequencies.
for seed in {-1..49}; do
Rscript scripts/load_data_original.R
Rscript scripts/modify_data_shuffled.R FALSE $seed
scripts/run_pnet_docker.sh -e pnet_shuffled_each -l $seed -u $seed
done
Rscript scripts/load_data_mskimpact.R "Non-Small Cell Lung Cancer"
scripts/run_pnet_docker.sh -e mskimpact_nsclc_original
for seed in {-1..49}; do
Rscript scripts/load_data_mskimpact.R "Non-Small Cell Lung Cancer"
Rscript scripts/modify_data_shuffled.R FALSE $seed
scripts/run_pnet_docker.sh -e mskimpact_nsclc_shuffled -l $seed -u $seed
done
Rscript scripts/load_data_mskimpact.R "Breast Cancer"
scripts/run_pnet_docker.sh -e mskimpact_bc_original
for seed in {-1..49}; do
Rscript scripts/load_data_mskimpact.R "Breast Cancer"
Rscript scripts/modify_data_shuffled.R FALSE $seed
scripts/run_pnet_docker.sh -e mskimpact_bc_shuffled -l $seed -u $seed
done
Rscript scripts/load_data_mskimpact.R "Colorectal Cancer"
scripts/run_pnet_docker.sh -e mskimpact_cc_original
for seed in {-1..49}; do
Rscript scripts/load_data_mskimpact.R "Colorectal Cancer"
Rscript scripts/modify_data_shuffled.R FALSE $seed
scripts/run_pnet_docker.sh -e mskimpact_cc_shuffled -l $seed -u $seed
done
Rscript scripts/load_data_mskimpact.R "Prostate Cancer"
scripts/run_pnet_docker.sh -e mskimpact_pc_original
for seed in {-1..49}; do
Rscript scripts/load_data_mskimpact.R "Prostate Cancer"
Rscript scripts/modify_data_shuffled.R FALSE $seed
scripts/run_pnet_docker.sh -e mskimpact_pc_shuffled -l $seed -u $seed
done
Run DTox with seeds ranging from 0 (i.e., the original seed) to 50.
scripts/run_dtox_docker.sh
Results from each run are saved in a subfolder indicating the random seed used (e.g., data/dtox/0
).
plot_figures.R
generates all figures shown in the publication, using files in data
(described below):
Rscript scripts/plot_figures.R
styling.R
is required by this script.
After each run, the following files are copied from the P-NET output folders:
analysis/extracted/node_importance_graph_adjusted.csv
(renamed tonode_importance.csv
): contains node importance scores, with the following columns:- (first, unnamed): node name
- coef: original node importance scores
- coef_graph: indegree plus outdegree of node
- coef_combined: adjusted node importance score (= coef / coef_graph if coef_graph > mean(coef_graph) + 5 sd(coef_graph) in the respective layer)
- coef_combined_zscore: scaled coef_combined
- coef_combined2: z(z(coef_graph) - z(coef))
- layer: layer of the node
_logs/p1000/pnet/onsplit_average_reg_10_tanh_large_testing/P-net_ALL_testing.csv
(renamed topredictions_test.csv
): predictions for the test set, with the following columns:- (first, unnamed): sample name
- pred: predicted class (unfortunately, encoded by a double 1.0 or 0.0)
- pred_scores: probability of the predicted class
- y: true class (encoded as integer 1 or 0)
_logs/p1000/pnet/onsplit_average_reg_10_tanh_large_testing/P-net_ALL_training.csv
(renamed topredictions_train.csv
): predictions for the training set (same columns as above)
The following files generated by DTox are required for subsequent analyses:
module_relevance.tsv
: contains node importance scores, with the following columns:- (first, unnamed): compound identifier
- remaining columns: node identifiers (UniProt and Reactome IDs)
test_labels.csv
: predictions for the test set, with two columns:- truth: true label (0 or 1)
- predicted: predicted label (decimal number between 0 and 1)
The folder docker/pnet
contains everything needed for building a Docker image with P-NET installed:
Dockerfile
: instructions for assembling the imageentrypoint.sh
: script for running P-NET; used as entrypoint in the containerenvironment_pnet.yml
: conda environment specificationpatch_seeds.diff
: patch that allows to change the random seed for P-NETsetup.sh
: executed during image assembly; installs P-NET (with input data) and conda
Build and deploy this image via
docker build --tag ghcr.io/csbg/pnet-container:1.0.0 .
docker push ghcr.io/csbg/pnet-container:1.0.0
The folder docker/dtox
contains everything needed for building a Docker image with DTox installed:
Dockerfile
: instructions for assembling the imageentrypoint.sh
: entrypoint in the container; activates conda environmentenvironment_dtox.yml
: conda environment specificationpatch_seeds.diff
: patch that allows to change the random seed for DTox and saves predicted labels for the test setrun_dtox.py
: executes the DTox workflow as described in the tutorial available in the DTox GitHub repositorysetup.sh
: executed during image assembly; installs DTox and conda
Build and deploy this image via
docker build --tag ghcr.io/csbg/dtox-container:1.0.0 .
docker push ghcr.io/csbg/dtox-container:1.0.0
The folder docker/r
contains the Dockerfile
needed for building a Docker image with R and required packages installed.
Build and deploy this image via
cp ../../renv.lock .
docker build --tag ghcr.io/csbg/r_pnet_robustness:1.0.0 .
docker push ghcr.io/csbg/r_pnet_robustness:1.0.0
genes/
: only the genes present in both of the following two files will be analyzed: (a)tcga_prostate_expressed_genes_and_cancer_genes.csv
(b)HUGO_genes/protein-coding_gene_with_coordinate_minimal.txt
(TSV, no column names; meaning of columns: chromosome, start, end, gene name)pathways/
:pathways_short_names.xlsx
: short pathway names for figure labelsReactome/ReactomePathways.gmt
: genes associated with Reactome pathways; TSV, no column names, variable number of columns: (1) pathway name (2) reactome id (3) type (unused) (4ff) associated genesReactome/ReactomePathways.txt
: TSV mapping Reactome ids to names; loaded by P-NET but apparently not used (?)Reactome/ReactomePathwaysRelation.txt
: TSV specifying the Reactome pathway hierarchy as edge list; columns indicate parent and child; only human pathways are used (i.e., the child id has to start with "HSA")
prostate/
processed/
P1000_final_analysis_set_cross_important_only.csv
: mutation data; first column contains sample name, remaining columns represent genes, cells indicate number of mutations; data is preprocessed to a binary matrix, indicating presence/absence of at least one mutation (i.e., 1 if original >= 1)P1000_data_CNA_paper.csv
: CNV data; first column (unnamed) contains sample name, remaining columns represent genes, cells indicate copy number status; data is preprocessed to two binary matrices: one indicates presence of copy number amplification (1 if original > 1.5), the other indicates presence of CN deletion (1 if original < -1.5)response_paper.csv
: input labels, two columns: (1) id – sample name (2) response – sample label (1 = metastatic tumor)
splits/
: splits of input data; all files have three columns: (1) [unnamed] – running number starting at zero (2) id – sample name (3) response – sample label (column is NOT used by P-NET!)test_set.csv
: samples in the test settraining_set_0.csv
: samples in the training setvalidation_set.csv
: samples in the validation set