add exercise scripts for participants

posit-conf-2024 · Aug 9, 2024 · bd50467 · bd50467
1 parent aa439f8
commit bd50467
Show file tree

Hide file tree

Showing 6 changed files with 332 additions and 0 deletions.
diff --git a/_quarto.yaml b/_quarto.yaml
@@ -2,6 +2,7 @@ project:
   type: website
   render:
     - "*.qmd"
+    - "!exercises/"
   execute-dir: project
 
 website:

diff --git a/exercises/1_hello_arrow_exercises-participants.qmd b/exercises/1_hello_arrow_exercises-participants.qmd
@@ -0,0 +1,50 @@
+---
+title: "Hello Arrow Exercises"
+execute:
+  echo: true
+  messages: false
+  warning: false
+  cache: true
+---
+
+```{r}
+#| label: load-packages
+library(arrow)
+library(dplyr)
+library(tictoc)
+```
+
+
+```{r}
+#| label: open-dataset
+nyc_taxi <- open_dataset("data/nyc-taxi")
+nyc_taxi
+```
+
+
+## First dplyr pipeline with Arrow
+
+## Problems
+
+1.  Calculate the longest trip distance for every month in 2019
+
+2.  How long did this query take to run?
+
+## Solution 1
+
+Longest trip distance for every month in 2019:
+
+```{r}
+#| label: first-dplyr-exercise1
+
+```
+
+## Solution 2
+
+Compute time:
+
+```{r}
+#| label: first-dplyr-exercise2
+
+```
+
diff --git a/exercises/2_data_manipulation_1_exercises-participants.qmd b/exercises/2_data_manipulation_1_exercises-participants.qmd
@@ -0,0 +1,73 @@
+---
+title: "Data Manipulation Part 1 - Exercises"
+execute:
+  echo: true
+  messages: false
+  warning: false
+  cache: true
+---
+
+```{r}
+#| label: load-packages
+library(arrow)
+library(dplyr)
+library(stringr)
+```
+
+```{r}
+#| label: open-dataset
+nyc_taxi <- open_dataset("data/nyc-taxi")
+nyc_taxi
+```
+
+
+# Using `collect()` to run a query
+
+## Problems
+
+Use the function `nrow()` to work out the answers to these questions:
+
+1.  How many taxi fares in the dataset had a total amount greater than $100?
+
+
+## Solution 1
+
+```{r}
+#| label: collect-1
+
+```
+
+
+
+# Using the dplyr API in arrow
+
+## Problems
+
+1.  Use the `dplyr::filter()` and `stringr::str_ends()` functions to return a subset of the data which is a) from September 2020, and b) the value in `vendor_name` ends with the letter "S".
+
+2.  Try to use the `stringr` function `str_replace_na()` to replace any `NA` values in the `vendor_name` column with the string "No vendor" instead. What happens, and why?
+
+3.  Bonus question: see if you can find a different way of completing the task in question 2.
+
+## Solution 1
+
+```{r}
+#| label: collect-sol1
+
+```
+
+## Solution 2
+
+```{r}
+#| label: collect-sol2
+
+```
+
+## Solution 3
+
+```{r}
+#| label: collect-sol3
+
+```
+
+
diff --git a/exercises/3_data_engineering_exercises-participants.qmd b/exercises/3_data_engineering_exercises-participants.qmd
@@ -0,0 +1,120 @@
+---
+title: "Data Engineering with Arrow Exercises"
+execute:
+  echo: true
+  messages: false
+  warning: false
+  cache: true
+---
+
+# Data Types & Controlling the Schema
+
+```{r}
+#| label: load-packages
+library(arrow)
+library(dplyr)
+library(tictoc)
+```
+
+```{r}
+#| label: open-dataset-seattle-csv
+seattle_csv <- open_dataset(
+  sources = "data/seattle-library-checkouts.csv",
+  format = "csv"
+)
+```
+
+
+## Problems
+
+1.  The first few thousand rows of `ISBN` are blank in the Seattle Checkouts CSV file. Read in the Seattle Checkouts CSV file with `open_dataset()` and ensure the correct data type for `ISBN` is `<string>` instead of the `<null>` interpreted by Arrow.
+
+2.  Once you have a `Dataset` object with the metadata you are after, count the number of `Checkouts` by `CheckoutYear` and arrange the result by `CheckoutYear`.
+
+## Solution 1
+
+```{r}
+#| label: seattle-csv-schema-1
+
+```
+
+## Solution 2
+
+The number of `Checkouts` by `CheckoutYear` arranged by `CheckoutYear`:
+
+```{r}
+#| label: seattle-csv-dplyr-1
+
+```
+
+
+# Parquet
+
+```{r}
+#| label: write-dataset-seattle-parquet
+#| eval: false
+seattle_parquet <- "data/seattle-library-checkouts-parquet"
+
+seattle_csv |>
+  write_dataset(path = seattle_parquet,
+                format = "parquet")
+```
+
+
+## Problem
+
+1.  Re-run the query counting the number of `Checkouts` by `CheckoutYear` and arranging the result by `CheckoutYear`, this time using the Seattle Checkout data saved to disk as a single, Parquet file. Did you notice a difference in compute time?
+
+## Solution 1
+
+```{r}
+#| label: seattle-parquet-dplyr-1
+
+```
+
+
+# Partitioning
+
+```{r}
+#| label: write-dataset-seattle-partitioned
+#| eval: false
+seattle_parquet_part <- "data/seattle-library-checkouts"
+
+seattle_csv |>
+  group_by(CheckoutYear) |>
+  write_dataset(path = seattle_parquet_part,
+                format = "parquet")
+```
+
+
+## Problems
+
+1.  Let's write the Seattle Checkout CSV data to a multi-file dataset just one more time! This time, write the data partitioned by `CheckoutType` as Parquet files.
+
+2.  Now compare the compute time between our Parquet data partitioned by `CheckoutYear` and our Parquet data partitioned by `CheckoutType` with a query of the total number of checkouts in September of 2019. Did you find a difference in compute time?
+
+## Solution 1
+
+Writing the data:
+
+```{r}
+#| label: write-dataset-seattle-checkouttype
+
+```
+
+## Solution 2
+
+Total number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutType`:
+
+```{r}
+#| label: seattle-partitioned-other-dplyr
+
+```
+
+Total number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutYear` and `CheckoutMonth`:
+
+```{r}
+#| label: seattle-partitioned-partitioned-filter-dplyr
+
+```
+
diff --git a/exercises/4_data_manipulation_2_exercises-participants.qmd b/exercises/4_data_manipulation_2_exercises-participants.qmd
@@ -0,0 +1,50 @@
+---
+title: "Data Manipulation Part 2 - Exercises"
+execute:
+  echo: true
+  messages: false
+  warning: false
+  cache: true
+---
+
+```{r}
+#| label: load-packages
+library(arrow)
+library(dplyr)
+```
+
+```{r}
+#| label: open-dataset
+nyc_taxi <- open_dataset("data/nyc-taxi")
+nyc_taxi
+```
+
+
+# User-defined functions
+
+## Problem
+
+1.  Write a user-defined function which wraps the `stringr` function `str_replace_na()`, and use it to replace any `NA` values in the `vendor_name` column with the string "No vendor" instead. (Test it on the data from 2019 so you're not pulling everything into memory)
+
+## Solution 1
+
+```{r}
+#| label: udf-solution
+
+```
+
+
+# Joins
+
+## Problem
+
+1.  How many taxi pickups were recorded in 2019 from the three major airports covered by the NYC Taxis data set (JFK, LaGuardia, Newark)? (Hint: you can use `stringr::str_detect()` to help you find pickup zones with the word "Airport" in them)
+
+## Solution 1
+
+```{r}
+#| label: airport-pickup
+
+```
+
+
diff --git a/exercises/5_arrow_single_file_exercises-participants.qmd b/exercises/5_arrow_single_file_exercises-participants.qmd
@@ -0,0 +1,38 @@
+---
+title: "Arrow In-Memory Exercise"
+execute:
+  echo: true
+  messages: false
+  warning: false
+  cache: true
+---
+
+```{r}
+#| label: load-packages
+library(arrow)
+library(dplyr)
+```
+
+
+## Arrow Table
+
+## Problem
+
+1.  Read in a single NYC Taxi parquet file using `read_parquet()` as an Arrow Table
+
+2.  Convert your Arrow Table object to a `data.frame` or a `tibble`
+
+## Solution 1
+
+```{r}
+#| label: arrow-table-read
+
+```
+
+## Solution 2
+
+```{r}
+#| label: table-to-tibble
+
+```
+