-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.Rmd
222 lines (173 loc) · 7.98 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "75%",
fig.retina = 2
)
library(tidyverse)
library(patchwork)
library(tibble)
library(magrittr)
library(ggplot2)
library(gghilbertstrings)
library(glue)
reps <- 10 #takes about 50 seconds on iMac i9 3.5GHz
size_exponent <- 11
```
# gghilbertstrings
<!-- badges: start -->
[![Travis build status](https://travis-ci.com/Sumidu/gghilbertstrings.svg?branch=master)](https://travis-ci.com/Sumidu/gghilbertstrings)
[![AppVeyor build status](https://ci.appveyor.com/api/projects/status/github/Sumidu/gghilbertstrings?branch=master&svg=true)](https://ci.appveyor.com/project/Sumidu/gghilbertstrings)
[![Codecov test coverage](https://codecov.io/gh/Sumidu/gghilbertstrings/branch/master/graph/badge.svg)](https://codecov.io/gh/Sumidu/gghilbertstrings?branch=master)
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html)
[![R-CMD-check](https://github.com/Sumidu/gghilbertstrings/workflows/R-CMD-check/badge.svg)](https://github.com/Sumidu/gghilbertstrings/actions)
[![CRAN status](https://www.r-pkg.org/badges/version/gghilbertstrings)](https://CRAN.R-project.org/package=gghilbertstrings)
<!-- badges: end -->
A [Hilbert curve](https://en.wikipedia.org/wiki/Hilbert_curve) (also known as a Hilbert space-filling curve) is a continuous fractal space-filling curve first described by the German mathematician David Hilbert in 1891, as a variant of the space-filling Peano curves discovered by Giuseppe Peano in 1890 (from Wikipedia).
This package provides an easy access to using Hilbert curves in `ggplot2`.
## Installation
You can install the released version of gghilbertstrings from [CRAN](https://CRAN.R-project.org) with:
``` r
install.packages("gghilbertstrings")
```
You can install the development version from [GitHub](https://github.com/) with:
``` r
# install.packages("remotes") # run only if not installed
remotes::install_github("Sumidu/gghilbertstrings")
```
## Usage
The `gghilbertstrings` package comes with functions for fast plotting of Hilbert curves in ggplot. At it's core is a fast RCpp implementation that maps a 1D vector to a 2D position.
The `gghilbertplot` function creates a Hilbert curve and plots individual data points to the corners of this plot. It automatically rescales the used `ID`-variable to the full range of the Hilbert curve. The method also automatically picks a suitable level of detail able to represent all values of `ID`.
The following figure shows different hilbert curves for different maximum `ID`s.
```{r hilbert, echo=FALSE}
p1 <- tibble(id = c(1,4)) %>% gghilbertplot(id, add_curve = T) + ggtitle("n = 4") + theme_void()
p2 <- tibble(id = c(1,16)) %>% gghilbertplot(id, add_curve = T) + ggtitle("n = 16") + theme_void()
p3 <- tibble(id = c(1,64)) %>% gghilbertplot(id, add_curve = T) + ggtitle("n = 64") + theme_void()
p4 <- tibble(id = c(1,256)) %>% gghilbertplot(id, add_curve = T) + ggtitle("n = 256") + theme_void()
p1 + p2 + p3 + p4 + plot_annotation(title = "Different depths of Hilbert curves")
```
### Plotting random data
The most simple way to plot data is to generate an `id` column that ranges from 1 to n, where n is the largest value to use in the Hilbert curve. Beware: The `id`s are rounded to integers.
```{r example}
library(gghilbertstrings)
# val is the ID column used here
df <- tibble(val = 1:256,
size = runif(256, 1, 5), # create random sizes
color = rep(c(1,2,3,4),64)) # create random colours
gghilbertplot(df, val,
color = factor(color), # render color as a factor
size = size,
add_curve = T) # also render the curves
```
### Performance
We run the creation of a coordinate system `r reps` times. This means creating 1 entry for every possible corner in the Hilbert Curve.
```r
library(microbenchmark)
library(HilbertCurve)
library(tidyverse)
library(gghilbertstrings)
mb <- list()
for (i in 1:10) {
df <- tibble(val = 1:4^i,
size = runif(4^i, 1, 5),
# create random sizes
color = rep(c(1, 2, 3, 4), 4^(i - 1)))
values <- df$val
mb[[i]] <- microbenchmark(times = reps,
HilbertCurve = {
hc <- HilbertCurve(1, 4^i, level = i, newpage = FALSE)
},
gghilbertstrings = {
ggh <- hilbertd2xy(n = 4^i, values)
})
}
```
```{r setupmb, eval=TRUE, include=FALSE}
# Performance benchmark
library(microbenchmark)
remotes::install_github("jokergoo/HilbertCurve")
library(HilbertCurve)
library(tidyverse)
library(gghilbertstrings)
mb <- list()
for (i in 1:10) {
df <- tibble(val = 1:4^i,
size = runif(4^i, 1, 5),
# create random sizes
color = rep(c(1, 2, 3, 4), 4^(i - 1)))
values <- df$val
mb[[i]] <- microbenchmark(times = reps,
HilbertCurve = {
hc <- HilbertCurve(1, 4^i, level = i, newpage = FALSE)
},
gghilbertstrings = {
ggh <- hilbertd2xy(n = 4^i, values)
})
}
res <- data.frame()
for (i in 1:length(mb)) {
tmp <- mb[[i]] %>% as_tibble() %>% mutate(depth = i)
res <- res %>% bind_rows(tmp)
}
```
```{r output, echo=FALSE, message=FALSE, warning=FALSE}
library(scales)
res %>% mutate(time = time/1000) %>%
ggplot() +
aes(x = depth, y = time, color = expr) +
geom_jitter(width = 0.1, height = 0, alpha = 0.2) +
geom_smooth() +
scale_y_log10(labels = comma) +
scale_x_continuous(breaks = 3:14, minor_breaks = NULL) +
scale_color_viridis_d(begin = 0.5, end = 0.8, option = "D") +
labs(x = "Order of Hilbert Curve", y = "Time in ms (log-scale)", color = "Package",
title = "This package is two orders of magnitute faster",
subtitle = glue("Comparison of {reps} repetitions across all orders."),
caption = "Order 14 means 268,435,456 coordinates")
```
### Useful example
We use the `eliasdabbas/search-engine-results-flights-tickets-keywords` data set on [Kaggle](https://www.kaggle.com/eliasdabbas/search-engine-results-flights-tickets-keywords) as an example for a simple analysis. We map the full search URLs to the Hilbert curve and then add points when the URL was present for a specific search term. By comparing resulting facets we can see systematic difference in which provides show up for which search term.
```{r example_flt, echo=FALSE, message=FALSE, warning=FALSE}
#library(kaggler)
#kgl_auth() # requires a kaggle token.
data_set <- "eliasdabbas/search-engine-results-flights-tickets-keywords"
#refs <- kgl_datasets_list(owner_dataset = data_set)
#refs$datasetFiles
#kgl_datasets_download(owner_dataset = data_set,
# fileName = refs$datasetFiles$name[1], datasetVersionNumber = 14)
read_plus <- function(flnm) {
read_csv(flnm) %>%
mutate(filename = flnm)
}
# files must be added in the my_tests folder for this section to work.
df_with_sources <-
list.files(path = "my_tests",
pattern = "*.csv",
full.names = T) %>%
map_df(~read_plus(.))
comparison <- df_with_sources$searchTerms %>% unique() %>% head(4)
df_with_sources %>%
create_id_column(link) %>%
filter(searchTerms %in% comparison) %>%
gghilbertplot(gghid,
color = displayLink,
size = rank,
label = displayLink,
alpha = 0.1,
jitter = 0.1,
add_curve = F,
curve_alpha = 0.1) +
aes(group = displayLink) +
facet_wrap(~searchTerms) +
labs(title = "Comparsion of domains shown for different search queries") +
guides(color = FALSE) -> p
p
#plotly::ggplotly(p)
```
Link: https://www.kaggle.com/eliasdabbas/search-engine-results-flights-tickets-keywords under License CC0