02-methods.Rmd

---
output:
  html_document: default
  pdf_document: default
---

```{r}
knitr::opts_chunk$set(eval = FALSE)
```

# Methods

Our work is divided into `code`, `data`, and `app` sub-folders. The steps taken include using code to clean data, visualize data, and make a shiny application to communicate our findings.

## Data Processing
Initially, as part of the data cleaning step, we examined data to detect and correct corrupt or inaccurate records from the transcriptome and attribute tables. We realized that the mean of all genes is the same but standard deviation across the experiments vary. The tasks we took to clean expression data include: 
- removing genes with zero expression across the experiments
- removing genes that had additional `_` as part of their name

For biological replicates, we either removed one of the duplicates for visualization or averaged the gene expression levels for the duplicate samples.

We also identified incomplete, inaccurate or irrelevant data fields in the meta data tables. Specifically, the labels and annocation files were cleaned and modified for the rest of the analyses.

## Data Analysis

After data cleaning, our gene expression dataset had 87 unique experimental conditions (samples) and 5893 genes. We had little understanding of the domains the data belongs to since the data was scraped from various sources. In our analyses, we tried to make as little judgement as possible and leave the hypothesis-making to users instead by making the data elements interactive.
We identified hidden patterns in the data using a variety of dimensionality reduction algorithms. In statistics, machine learning, and information theory, dimensionality reduction or dimension reduction is the process of reducing the number of random variables under consideration by obtaining a set of principal variables. 

Our approaches included: 

- HEATMAP: heatmaps were used as a means for graphical representation of expression data 

- LIMMA: Limma algorithm was used for differential expression analysis involving comparisons between two groups of experiments at a time 

- TSNE: t-SNE (tsne) concept and algorithm which is well-suited for visualizing high-dimensional data. The name stands for t-distributed Stochastic Neighbor Embedding. The idea is to embed high-dimensional points in low dimensions in a way that respects similarities between points.

- UMAP: UMAP (Uniform Manifold Approximation and Projection) is a practical scalable algorithm to reduce dimensions based on a theoretical framework from Riemannian geometry and algebraic topology


## Data Visualization 
We deployed an shiny application in order to show the results of our analyses. The tabs within our dashboard include:
- HEATMAP
- LIMMA
- TSNE
- UMAP