Skip to content

Commit

Permalink
Final re-work of Part 1
Browse files Browse the repository at this point in the history
  • Loading branch information
Mark Dunning authored and Mark Dunning committed Jan 3, 2025
1 parent 9d5a567 commit ee9e6c6
Show file tree
Hide file tree
Showing 3 changed files with 172 additions and 90 deletions.
73 changes: 47 additions & 26 deletions Part1.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -312,7 +312,7 @@ summarise(Average = mean(age_at_diagnosis),

## Overview of plotting

Our recommending way of creating plots in RStudio is to use the `ggplot2` package
Our recommending way of creating plots in RStudio is to use the `ggplot2` package - especially as it interacts well with `dplyr` and other `tidyverse` packages.

```{r}
library(ggplot2)
Expand All @@ -331,8 +331,20 @@ The general principle of creating a plot is the same regardless of what kind of
- define the type of plot we want
- apply any additional format changes

A bar plot would be a natural choice for showing the counts of male / female samples. The `geom_bar` plot will automatically count how many occurrences there are for each value.

```{r}
ggplot(analysis_data, aes(x = gender)) + geom_bar()
```

Numerical data can be visualised using a density plot or histogram. The density is automatically calculated and displayed on the y-axis.

```{r}
ggplot(analysis_data, aes(x = age_at_diagnosis)) + geom_density()
```

In order to compare the age distributions of different tumour types we can imagine this being displayed as a boxplot with

In order to compare the age distributions of different tumour types we can also imagine this being displayed as a series of boxplots with

- the age variable on the y-axis
- the type of tumour on the x-axis
Expand All @@ -343,55 +355,66 @@ this can be translated into `ggplot2` language as follows -
ggplot(analysis_data, aes(x = tumor_tissue_site, y = age_at_diagnosis)) + geom_boxplot()
```

A disadvantage of the boxplot is that it only gives a very crude summary of the data.
A disadvantage of the boxplot is that it only gives a very crude summary of the data. It can be misleading when applied to data with few observations and is often preferable to add individual data points

```{r}
ggplot(analysis_data, aes(x = tumor_tissue_site, y = age_at_diagnosis)) + geom_boxplot() + geom_jitter(width=0.1)
```




A bar plot would be a natural choice for showing the counts of male / female samples. The `geom_bar` plot will automatically count how many occurrences there are for each value.
Adding some colour to the plot can be achieved by adding a `fill` aesthetic and specifying what column to map the colours too. A colour palette is automatically chosen, but can be changed afterwards if we wish.

```{r}
ggplot(analysis_data, aes(x = gender)) + geom_bar()
ggplot(analysis_data, aes(x = tumor_tissue_site, y = age_at_diagnosis, fill = tumor_tissue_site)) + geom_boxplot() + geom_jitter(width=0.1)
```
Adding the `fill` aesthetic for the density plot can be used to show a separate curve for each tumour type.

```{r}
## alpha of 0.5 used to make the curves transparent
ggplot(analysis_data, aes(x = age_at_diagnosis, fill = tumor_tissue_site)) + geom_density(alpha=0.5)
```
Another useful technique for splitting the plots based on a variable is to use the `facet_wrap` function that will give a grid of plots. For instance we can show male/female counts for each tumour type separately.

```{r}
bladder_data <- filter(analysis_data, tumor_tissue_site == "Bladder")
ggplot(bladder_data, aes(x = gender)) + geom_bar()
ggplot(analysis_data, aes(x = gender,fill=gender)) + geom_bar() + facet_wrap(~tumor_tissue_site)
```

By combining all the techniques we have seen we can compare the diagnosis age between males and females; separately for each tumour type.

```{r}
ggplot(analysis_data, aes(x = gender)) + geom_bar() + facet_wrap(~tumor_tissue_site)
ggplot(analysis_data, aes(x =gender, y = age_at_diagnosis, fill = gender)) + geom_boxplot() + geom_jitter(width=0.1) + facet_wrap(~tumor_tissue_site)
```

## Challenges of "messy" data
# Challenges of "messy" data

Real-life data are often
Real-life data are often less straightforward to deal with than the "cleaned" dataset presented here. Despite the many high-throughput technologies that are used for scientific investigation, there is inevitably a spreadsheet(s) needed to describe the experimental setup and this is typically entered manually.

So-called "Data Wrangling" is a crucial and time-consuming part of the analysis process taking 80% of analysis time by some estimates. Hadley Wickham, Chief Scientist at Posit and lead author of `ggplot2` likens tidy and messy data to Leo Tolstoy's quote about families:-

> Happy families are all alike; every unhappy family is unhappy in its own
way


> Like families, tidy datasets are all alike but every messy dataset is messy in its own way.
A comprehensive guide to the issues surrounding data entry via spreadsheets, and how to avoid them, is given by Data Carpentry.

- [Data Carpentry Spreadsheets lesson](https://data-lessons.github.io/gapminder-spreadsheet/)

However, for public data that we have no control over we often have no choice but to clean the data ourselves. We have intentionally created an alternative dataset with a few intentional issues to illustrate the cleaning process.

```{r}
messy <- read_tsv("tcga_clinical_MESSY.tsv")
messy
```

### Whitespace
## Whitespace

"whitespace" is the addition of a blank character or space to the beginning or end of text. Traditionally it is a problem because it will create extra categories in your data. e.g. `MALE` and `MALE `. The messy dataset that you have just imported includes some whitespace in the `tumor_tissue_site` column. However, the `read_tsv` function automatically ignores whitespace values as the `trim_ws` argument of `read_tsv` is set to `TRUE` (see the help page `?read_tsv`).

```{r}
messy_ws <- read_tsv("tcga_clinical_MESSY.tsv",
trim_ws = FALSE)
messy_ws
count(messy_ws,tumor_tissue_site)
```

The resulting data frame now contains two apparently identical categories for `Bladder`. However, with the use of the `nchar` function, which counts the number of characters, we can see that extra spaces must be included.
Expand All @@ -409,16 +432,13 @@ For the example of removing whitespace we can use the `str_trim` function combin

```{r}
library(stringr)
mutate(messy_ws, tumor_tissue_site = str_trim(tumor_tissue_site)) %>%
count(tumor_tissue_site)
```

## Inconsistent coding of variables

Unfortunately`tumor_tissue_site` column is not the only one with issue that need fixing with these data. If, as before, we try and plot the number of males/females in the dataset we get a surprise.

Unfortunately the `tumor_tissue_site` column is not the only one with issue that need fixing with these data. If, as before, we try and plot the number of males/females in the dataset we get a surprise.

```{r}
ggplot(messy, aes(x = gender)) + geom_bar()
Expand Down Expand Up @@ -489,6 +509,8 @@ Because the `NULL` value is present in the `age_at_diagnosis` column, R will tre
ggplot(messy, aes(x = age_at_diagnosis)) + geom_histogram()
```

Likewise we can't calculate numeric summaries; although R will attempt to and create a data frame of `NA` values rather than giving an error.

```{r}
group_by(messy, tumor_tissue_site) %>%
summarise(Mean_Diagnosis_Age = mean(age_at_diagnosis,na.rm=TRUE))
Expand Down Expand Up @@ -530,18 +552,17 @@ messy %>%
mutate(height_at_diagnosis=str_sub(height_at_diagnosis, end=-3)) %>%
mutate(height_at_diagnosis = as.numeric(height_at_diagnosis)) %>%
arrange(height_at_diagnosis)
```

### Final code to clean the data
## Final code to clean the data

For reference, here is the final code chunk that can be used to clean the data.

```{r}
cleaned <- read_tsv("tcga_clinical_MESSY.tsv", na = c("NULL","NA")) %>%
mutate(messy, gender = forcats::fct_recode(gender,"MALE"="male"),
gender = forcats::fct_recode(gender,"FEMALE"="female")) %>%
mutate(height_at_diagnosis=str_sub(height_at_diagnosis, end=-3)) %>%
mutate(height_at_diagnosis = as.numeric(height_at_diagnosis))
```

188 changes: 124 additions & 64 deletions Part1.nb.html

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ install.packages(c("readr",
+ [HTML](Part1.nb.html)
+ [Markdown](Part1.Rmd)
+ [Example Data (tcga_clinical_CLEANED.tsv)](tcga_clinical_CLEANED.tsv)
+ [Example Data 2 (tcga_clinical_MESSY.tsv)](tcga_clinical_MESSY.tsv)

## Part 2 (Tidy RNA-seq)

Expand Down

0 comments on commit ee9e6c6

Please sign in to comment.