-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
126 lines (93 loc) · 3.89 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# UnsupLearnR
<!-- badges: start -->
<!-- badges: end -->
The goal of UnsupLearnR is to assist with unsupervised learning in R - particularly with visualisation.
## Installation
You can install the development version of UnsupLearnR like so:
``` r
# install pacman
if (!require("pacman")) install.packages("pacman")
# install dependencies from CRAN
pacman::p_install(devtools, ggplot2, ggpubr)
# install graphicsPLr from GitHub
if (!require("graphicsPLr")) devtools::install_github('Phuong-Le/graphicsPLr')
# the package
devtools::install_github('Phuong-Le/UnsupLearnR')
```
## Example
I will use the iris data as a demonstration
PCA
```{r pca}
library(UnsupLearnR)
library(dplyr)
library(stats)
library(factoextra)
data(iris)
# let's briefly have a look at the iris data
head(iris)
# let's do a PCA on Sepal.Length, Sepal.Width and Petal.Length (ie leave out species)
iris_numeric = select(iris, -c(Species))
## performing PCA,
## the `scale. = T` option scales your data to standard normal distribution (mean=0, var=1) - this is generally recommended as the categories on PCA are not on the same units/scales
pca = prcomp(iris_numeric, scale. = T)
pca_ed = as.data.frame(pca$x)
# let's colour the PCA plot by iris species
# first create a param dataframe - the column to be coloured has to be named "param"
params = data.frame(param=iris$Species)
plot_pca(pca_ed, params = params)
# it's worth looking at the loadings plot, let's use fviz_pca_var from factoextra package (I think it does a really good job so I don't think I need to write a similar function in UnsupLearnR)
library(ggpubr)
fviz_pca_var(pca) +
theme_pubr()
# you can also choose the dimensions (principle components) on the x and y axis
plot_pca(pca_ed, param = params, x = 2, y= 3)
# now PCA preserves the number of dimensions, as we start with 4 variables (columns), we should now have 4 principle components
# how do you know how many dimensions to look at? let's check the scree plot (we should have done this before even looking at the pca plots themselves)
# The scree plot shows that the first 2 dimensions already captured almost 100% of the variation (ie information) of the data -
# so looking at the last 2 dimensions won't add much ore values
scree_plot(pca_ed)
```
UnsupLearnR also helps with clustering - note that this is designed for the function in the `stats` package
```{r clustering}
# let's do kmeans clustering
# let's set K=3 (because it's quite obvious on the pca plot), # let's use 30 initialisations because the first step of kmeans is a randomisation step
# and set a limit to the number of iterations at 20
k = 3
set.seed(2510)
cluster = make_cluster(iris_numeric, clusterFUN = kmeans, centers = k, nstart = 30, iter.max = 20)
set.seed(NULL)
# now that we have the cluster object, let's see how species are assigned to clusters
plot_labelled_cluster_bars(cluster = cluster, param = params)
# and map the clustering result to the PCA plot
plot_pca(pca_ed, params, cluster = cluster)
```
What if it's unclear how many K clusters to choose? Let's examine a range
```{r choose K}
library(ggplot2)
set.seed(2510)
n_clusters = 15
clusters = list()
for (i in 1:n_clusters) {
clusters[[i]] = make_cluster(iris_numeric, kmeans, centers = i, nstart = 30, iter.max = 20)
}
set.seed(NULL)
# let's look at the total variation within clusters plots
# total variation within clusters measures how far away datapoints are from their centers
# as K increases, it always decreases
# A good K is one where you see a kink
# here the decrease from 3 to 4 is tiny compared to 2 to 3, suggesting that 3 is a good K value
tot_within_var_plot(clusters = clusters) +
scale_x_continuous(n.breaks = n_clusters)
```