Skip to content

Commit

Permalink
update readmes
Browse files Browse the repository at this point in the history
  • Loading branch information
aadamk committed Sep 18, 2024
1 parent 7c03da2 commit f2e0c5c
Show file tree
Hide file tree
Showing 4 changed files with 8 additions and 279 deletions.
53 changes: 6 additions & 47 deletions analyses/data_preparation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,71 +22,30 @@ bash run_analysis.sh
### Description of scripts
***

`01-subset-data.R`: Purpose of this script is to subset the input data matrices to a specific short histology in order to create subsetted data matrices. These matrices will be used as input files for all downstream analyses modules.

#### Output

```
data
├── All.gainloss.txt.gz
├── gene-counts-rsem-expected_count-collapsed.rds
├── gene-expression-rsem-tpm-collapsed.rds
├── histologies.tsv
├── methyl-beta-values.rds
├── methyl-m-values.rds
├── snv-consensus-plus-hotspots.maf.tsv.gz
└── splice_events_pan_cancer_functional_filter.rds
```

***
`02-format-splice-data-pegasas.R`: Purpose of this script is to read RMATs file, , and filter to input functional sites per @naqvia and format it per PEGASAS specifications.

#### Output

```
data
└── splice-events-rmats-functional-sites.tsv.gz
```

***
`03-multi-modal-clustering-prepare-data.R` : The purpose of this script is to prepare input files from all available modalities and use them as input for the IntNMF clustering analysis.
`01-multi-modal-clustering-prepare-data.R` : The purpose of this script is to prepare input files from all available modalities and use them as input for the IntNMF clustering analysis.

#### Feature selection

Features were selected from OpenPedCan-analysis v12 datasets using the following filters:

1) **CNV**:

- The `gainloss.txt` file was first filtered to Medulloblatoma samples.
- Adjusted copy number was calculated.
- Neutral sites were removed.
- The dataset was reduced to features corresponding to "Gain", "Amplification" in Oncogenes and "Loss", "Complete Loss" in TSGs (using the comprehensive cancer gene list from `annoFuse`)
- Finally, features with `standard deviation >= 0.9` were chosen for clustering. (https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0176278)

2) **SNV**:

- The Consensus MAF dataset was first filtered to Medulloblatoma samples.
- Features were reduced to non-synonymous variant classifications e.g. `Missense_Mutation, Frame_Shift_Del, In_Frame_Ins, Frame_Shift_Ins, Splice_Site, Nonsense_Mutation, In_Frame_Del, Nonstop_Mutation, Translation_Start_Site`.
- Finally, features with `mutation rate > 0.02` (i.e. filter out features with low mutation rate) were chosen for clustering. (Reference: https://www.bioconductor.org/packages/release/bioc/vignettes/iClusterPlus/inst/doc/iManual.pdf)
Features were selected from OpenPedCan-analysis v15 datasets using the following filters:

3) **RNA**:
1) **RNA**:

- The expected counts dataset was first filtered to Medulloblatoma samples.
- Features were reduced to `Top 1000 most variable protein coding genes` followed by `Rank transformation`.

4) **Methylation**:
2) **Methylation**:

- Methylation beta-values matrix was first filtered to Medulloblatoma samples.
- Features were reduced to `Top 1000 most variable probes`.

5) **Splicing**:
3) **Splicing**:

- Splice matrix was first filtered to Medulloblatoma samples.
- Features were reduced to `Top 1000 most variable splice variants`.

#### Output

The script resulted in `152 Medulloblastoma samples` that have all 5 data modalities available. The filtered/transformed data matrices are written out to individual .tsv files. Additionally a mapping between `Kids_First_Biospecimen_ID` identifiers from each modality and `sample_id` is written out to `samples_map.tsv`.
The script resulted in `152 Medulloblastoma samples` that have all 3 data modalities available. The filtered/transformed data matrices are written out to individual .tsv files. Additionally a mapping between `Kids_First_Biospecimen_ID` identifiers from each modality and `sample_id` is written out to `samples_map.tsv`.

```
results
Expand Down
98 changes: 1 addition & 97 deletions analyses/dge_pathway_analysis/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,13 +40,11 @@ results/intNMF/deseq
├── diffexpr_output_per_cluster.tsv
├── hallmark
│ └── cluster_{n}_vs_rest_gsea.tsv
├── kegg_medicus
│ └── cluster_{n}_vs_rest_gsea.tsv
└── reactome
└── cluster_{n}_vs_rest_gsea.tsv
```

For each `cluster-vs-rest` enrichment, a barplot of top 10 upregulated and top 10 downregulated (i.e. maximum of 20 pathways) identified by `clusterProfiler::GSEA` utilizing `KEGG MEDICUS` and `REACTOME` pathways at `FDR < 0.05` is generated under `*_gsea_barplot.pdf`, dotplot is generated under `*_gsea_dotplot.pdf` and network under `*_gsea_cnet.pdf`.
For each `cluster-vs-rest` enrichment, a barplot of top 10 upregulated and top 10 downregulated (i.e. maximum of 20 pathways) identified by `clusterProfiler::GSEA` utilizing `HALLMARK` and `REACTOME` pathways at `FDR < 0.05` is generated under `*_gsea_barplot.pdf`, dotplot is generated under `*_gsea_dotplot.pdf` and network under `*_gsea_cnet.pdf`.

```
# here n is cluster number of interest
Expand All @@ -56,103 +54,9 @@ plots/intNMF/deseq
│ ├── cluster_{n}_vs_rest_gsea_barplot.pdf
│ ├── cluster_{n}_vs_rest_gsea_cnet.pdf
│ └── cluster_{n}_vs_rest_gsea_dotplot.pdf
├── kegg_medicus
│ ├── cluster_{n}_vs_rest_gsea_barplot.pdf
│ ├── cluster_{n}_vs_rest_gsea_cnet.pdf
│ └── cluster_{n}_vs_rest_gsea_dotplot.pdf
└── reactome
├── cluster_{n}_vs_rest_gsea_barplot.pdf
├── cluster_{n}_vs_rest_gsea_cnet.pdf
└── cluster_{n}_vs_rest_gsea_dotplot.pdf
```

***
`01-dge_analysis_noiseq.R`: The function of this script is to perform differential gene expression using [NOISeq](https://www.bioconductor.org/packages/release/bioc/html/NOISeq.html) R package using the `cluster vs rest` approach.

#### Inputs

```
../../data
├── c2.cp.kegg_medicus.v2023.2.Hs.symbols.gmt # KEGG MEDICUS gmt file
└── gencode.v39.primary_assembly.annotation.gtf.gz # gencode v39
# cohort specific files
../data_preparation/data
└── gene-counts-rsem-expected_count-collapsed.rds
# intNMF inputs but any other clustering method can be used
../intNMF
└── results/intnmf_clusters.tsv # intNMF clusters
```

#### Outputs

For all the `cluster-vs-rest` comparisons, the NOISeq output with FDR adjusted p-value < 0.05 is saved under `diffexpr_output_per_cluster.tsv` and the pathway enrichment output using `clusterProfiler::GSEA` with FDR adjusted p-value < 0.05 is saved under `_gsea.tsv`.


```
# NOISeq output
results/intNMF/noiseq
├── diffexpr_output_per_cluster.tsv
├── hallmark
│ └── cluster_{n}_vs_rest_gsea.tsv
├── kegg_medicus
│ └── cluster_{n}_vs_rest_gsea.tsv
└── reactome
└── cluster_{n}_vs_rest_gsea.tsv
```

For each `cluster-vs-rest` enrichment, a barplot of top 10 upregulated and top 10 downregulated (i.e. maximum of 20 pathways) identified by `clusterProfiler::GSEA` utilizing `KEGG MEDICUS` and `REACTOME` pathways at `FDR < 0.05` is generated under `*_gsea_barplot.pdf`, dotplot is generated under `*_gsea_dotplot.pdf` and network under `*_gsea_cnet.pdf`.


When using NOISeq for differential expression analysis, `noiseq_pca.pdf` is generated which has the PCA plot of input matrix before and after NOISeq batch correction.

```
# here n is cluster number of interest
# NOISeq output
plots/intNMF/noiseq
├── hallmark
│ ├── cluster_{n}_vs_rest_gsea_barplot.pdf
│ ├── cluster_{n}_vs_rest_gsea_cnet.pdf
│ └── cluster_{n}_vs_rest_gsea_dotplot.pdf
├── kegg_medicus
│ ├── cluster_{n}_vs_rest_gsea_barplot.pdf
│ ├── cluster_{n}_vs_rest_gsea_cnet.pdf
│ └── cluster_{n}_vs_rest_gsea_dotplot.pdf
├── noiseq_pca.pdf
└── reactome
├── cluster_{n}_vs_rest_gsea_barplot.pdf
├── cluster_{n}_vs_rest_gsea_cnet.pdf
└── cluster_{n}_vs_rest_gsea_dotplot.pdf
```
***

`02-dge_diablo_integration.R`: Purpose of this script is to use DGEs from `DESeq2` or `NOISeq` and compute an intersection with the output of `DIABLO` descriptive analysis.

#### Inputs:

```
# DIABLO descriptive output
../diablo/results/intnmf/descriptive
└── diablo_MB.rds
# DESeq2 output
results/intNMF/deseq
└── diffexpr_output_per_cluster.tsv
# or NOISeq output
results/intNMF/noiseq
└── diffexpr_output_per_cluster.tsv
```

#### Outputs:

```
# DESeq2
results/intNMF/deseq/diablo_integration
└── comp{component_number}_diablo_dge_intersection.tsv
# NOISeq
results/intNMF/noiseq/diablo_integration
└── comp{component_number}_diablo_dge_intersection.tsv
```
8 changes: 0 additions & 8 deletions analyses/intNMF/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,21 +14,13 @@ bash run_analysis.sh
***
`01-multi-modal-clustering-run.R`: Purpose of this script is to read input files generated in the `data_preparation` module and run IntNMF to identify the most optimal cluster fitting the dataset of interest.

#### Selection of optimal cluster

From this comment: https://github.com/d3b-center/bixu-tracker/issues/1704#issuecomment-1480304368
For each attempted k, the cluster values were mapped to the samples for each mode of data (e.g. CNV, SNV, Methylation, RNA and Splicing), the fpc stats for each mode were computed, sum of the `average.between` and `average.within` across the modes of data at each k were taken, then the difference of those two summed values at each k was taken to select the optimal cluster number (e.g. the k with the largest difference between the two).

Using this method, the samples were classified into `14 clusters`.

#### Input

```
../data_preparation/results
├── cnv_data.tsv # cnv data used as input
├── methyl_data.tsv # methylation data used as input
├── norm_counts.tsv # expression data used as input
├── snv_data.tsv # snv data used as input
├── splice_data.tsv # splice data used as input
└── samples_map.tsv # biospecimens + cohort identifiers for samples used for each modality
```
Expand Down
128 changes: 1 addition & 127 deletions analyses/methylation_analysis/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,135 +84,9 @@ plots
└── genebody_promoter_gsameth_pathways.pdf
```

### Functional prioritization of differentially methylated CpG sites

`02-methreg_analysis.R`: The R package `MethReg` uses triplet information from `DNA methylation + RNA expression data + transcription factor (TF) binding sites` for functional prioritization of significantly differentially methylated sites.

The input for this script is *matched* `methylation m-values` and `RNA expression matrix (raw counts)` matrix.

- There are two types of workflows: supervised and unsupervised.

1. The `unsupervised workflow` does not account for any design information (i.e. cannot differentiate groups in the matrix) so the solution would be running on each cluster individually by subsetting to samples of interest and running the workflow.
2. For the `supervised workflow`, we can use the cluster-specific DMS or DMRs as input.

**Note**: We chose to use the DMSites from the `gene body + promoter` output from step 1) i.e. the output of `01-limma_analysis.R` to run the supervised workflow.

- Creating the triplets (DNA methylation + RNA expression + TF binding sites) can be done in three ways as described [here](https://www.bioconductor.org/packages/release/bioc/vignettes/MethReg/inst/doc/MethReg.html#methreg-workflow).

1. Mapping the region to the closest gene (`target.method = "genes.promoter.overlap"`).
2. Mapping the region to a specific number of genes upstream down/upstream of the region (`target.method = "nearby.genes"`)
3. Mapping the region to all the genes within a window (default size = 500 kbp around the region, i.e. +- 250 kbp from start or end of the region) (`target.method = "window"`)

Linear model is applied where methylation levels are represented as binary values for high and low values.

**Note**: We used `Mapping the region to the closest gene` because it takes the least amount of runtime. The other two methods show a runtime of 25-30h.

#### Inputs

```
# methylation m-values subsetted to cohort of interest
../data_preparation/data
└── methyl-m-values.rds
# gene expression
../data_preparation/data
└── gene-counts-rsem-expected_count-collapsed.rds
# intNMF derived clusters
../intNMF/results/intnmf_clusters.tsv
# differentially methylated CpG sites
results/limma_output
└── genebody_promoter_diffexpr_probes_per_cluster.tsv
```

#### Outputs

- Output of the workflow:

1. Table of significant triplets with the role of TF in the target gene expression (repressor or activator) and the role of DNA methylation on TF (enhancing or attenuating).
2. Table of fitting the linear model with p-value and estimate of 1) direct effect of DNAm 2) direct effect of TF 3) synergistic effect of DNAm + TF on gene expression.
3. Associated scatter plots of top 5 triplets per cluster.

Here is the output of supervised workflow with DMSites obtained after running `01-limma_analysis.R` + triplet creation using mapping TF binding sites to the closest gene:

1. `triplet_nearest_gene_interactions.tsv`: Table of fitting the linear model with p-value and estimate of 1) direct effect of DNAm 2) direct effect of TF 3) synergistic effect of DNAm + TF on gene expression.

2. `triplet_nearest_gene_interactions_stratified_model.tsv`: Table of significant triplets with the role of TF in the target gene expression (repressor or activator) and the role of DNA methylation on TF (enhancing or attenuating).

```
results
└── methreg_output
├── methylation_matrix_methreg.rds
├── expression_matrix_methreg.rds
├── triplet_nearest_gene_interactions.tsv
└── triplet_nearest_gene_interactions_stratified_model.tsv
```

Plots of top 5 triplets per cluster:

```
# n is the cluster number
plots
└── methreg_output
└── triplet_nearest_gene_interactions_top5_cluster_{n}.pdf
```

### Pathway enrichment of functional CpG sites and differentially expressed target genes

`03-methreg_fgsea_analysis.R`: The purpose of this script is to perform

1. Pathway enrichment using `missMethyl::gsameth` on the function CpG sites identified by step 2 i.e. `02-methreg_analysis.R`. Only those interactions are used where the the there is a 1) direct effect of DNAm and 2) synergistic effect of DNAm + TF on gene expression.
2. Pathway enrichment using `fgsea` on the target genes of those CpG sites. Only those target genes are used that are found to be differentially expressed using DESeq2 per the `../dge_pathway_analysis` module.

#### Inputs

```
# triplets identified in 02-methreg_analysis.R
results/methreg_output
└── triplet_nearest_gene_interactions.tsv
# differentially expressed genes using DESeq2
../dge_pathway_analysis/results/intNMF/deseq
└── diffexpr_output_per_cluster.tsv
# KEGG Medicus pathway gmt file
# for differentially methylated CpG sites pathway enrichment
input
└── c2.cp.kegg_medicus.v2023.2.Hs.entrez.gmt
# KEGG Medicus pathway gmt file
# for differentially expressed target genes pathway enrichment
input
└── c2.cp.kegg_medicus.v2023.2.Hs.symbols.gmt
# Hallmark pathway gmt file
# for differentially methylated CpG sites pathway enrichment
input
└── h.all.v2023.2.Hs.entrez.gmt
# Hallmark pathway gmt file
# for differentially expressed target genes pathway enrichment
input
└── h.all.v2023.2.Hs.symbols.gmt
```

#### Outputs

```
results
└── methreg_output
├── hallmark
│ ├── functional_cpg_pathway_enrichment.tsv # enrichment of differentially methylated + functional CpG sites
│ └── functional_target_gene_pathway_enrichment.tsv # enrichment of differentially expressed target genes
└── kegg_medicus
├── functional_cpg_pathway_enrichment.tsv
└── functional_target_gene_pathway_enrichment.tsv
```

### Differential methylated region analysis (DMRcate) + Pathway Enrichment (gsaregion)

`04-dmr_gsaregion_analysis.R`: The function of this script is to pull the full medulloblastoma methylation m-values dataset and filter the dataset into probes representing promoter region, gene-body (introns and exons) and promoter + gene-body combined. Next, differential region-level methylation analyses is performed with `DMRcate::dmrcate` using a cluster-of-interest vs 'rest' approach, and specifying 'EPIC' array. Finally, `missMethyl::gsaregion` analysis (setting sig.genes = TRUE) is performed on the the resulting differentially methylated regions to determine cluster-specific pathway differences.
`03-dmr_gsaregion_analysis.R`: The function of this script is to pull the full medulloblastoma methylation m-values dataset and filter the dataset into probes representing promoter region, gene-body (introns and exons) and promoter + gene-body combined. Next, differential region-level methylation analyses is performed with `DMRcate::dmrcate` using a cluster-of-interest vs 'rest' approach, and specifying 'EPIC' array. Finally, `missMethyl::gsaregion` analysis (setting sig.genes = TRUE) is performed on the the resulting differentially methylated regions to determine cluster-specific pathway differences.

Tests of interest (array.type = 'EPIC'):
a. gene body only
Expand Down

0 comments on commit f2e0c5c

Please sign in to comment.