diff --git a/_quarto.yaml b/_quarto.yaml index 8d06705..612348c 100644 --- a/_quarto.yaml +++ b/_quarto.yaml @@ -2,6 +2,7 @@ project: type: website render: - "*.qmd" + - "!exercises/" execute-dir: project website: diff --git a/exercises/1_hello_arrow_exercises-participants.qmd b/exercises/1_hello_arrow_exercises-participants.qmd new file mode 100644 index 0000000..c3bee5f --- /dev/null +++ b/exercises/1_hello_arrow_exercises-participants.qmd @@ -0,0 +1,50 @@ +--- +title: "Hello Arrow Exercises" +execute: + echo: true + messages: false + warning: false + cache: true +--- + +```{r} +#| label: load-packages +library(arrow) +library(dplyr) +library(tictoc) +``` + + +```{r} +#| label: open-dataset +nyc_taxi <- open_dataset("data/nyc-taxi") +nyc_taxi +``` + + +## First dplyr pipeline with Arrow + +## Problems + +1. Calculate the longest trip distance for every month in 2019 + +2. How long did this query take to run? + +## Solution 1 + +Longest trip distance for every month in 2019: + +```{r} +#| label: first-dplyr-exercise1 + +``` + +## Solution 2 + +Compute time: + +```{r} +#| label: first-dplyr-exercise2 + +``` + diff --git a/exercises/2_data_manipulation_1_exercises-participants.qmd b/exercises/2_data_manipulation_1_exercises-participants.qmd new file mode 100644 index 0000000..8cef11a --- /dev/null +++ b/exercises/2_data_manipulation_1_exercises-participants.qmd @@ -0,0 +1,73 @@ +--- +title: "Data Manipulation Part 1 - Exercises" +execute: + echo: true + messages: false + warning: false + cache: true +--- + +```{r} +#| label: load-packages +library(arrow) +library(dplyr) +library(stringr) +``` + +```{r} +#| label: open-dataset +nyc_taxi <- open_dataset("data/nyc-taxi") +nyc_taxi +``` + + +# Using `collect()` to run a query + +## Problems + +Use the function `nrow()` to work out the answers to these questions: + +1. How many taxi fares in the dataset had a total amount greater than $100? + + +## Solution 1 + +```{r} +#| label: collect-1 + +``` + + + +# Using the dplyr API in arrow + +## Problems + +1. Use the `dplyr::filter()` and `stringr::str_ends()` functions to return a subset of the data which is a) from September 2020, and b) the value in `vendor_name` ends with the letter "S". + +2. Try to use the `stringr` function `str_replace_na()` to replace any `NA` values in the `vendor_name` column with the string "No vendor" instead. What happens, and why? + +3. Bonus question: see if you can find a different way of completing the task in question 2. + +## Solution 1 + +```{r} +#| label: collect-sol1 + +``` + +## Solution 2 + +```{r} +#| label: collect-sol2 + +``` + +## Solution 3 + +```{r} +#| label: collect-sol3 + +``` + + diff --git a/exercises/3_data_engineering_exercises-participants.qmd b/exercises/3_data_engineering_exercises-participants.qmd new file mode 100644 index 0000000..41d30be --- /dev/null +++ b/exercises/3_data_engineering_exercises-participants.qmd @@ -0,0 +1,120 @@ +--- +title: "Data Engineering with Arrow Exercises" +execute: + echo: true + messages: false + warning: false + cache: true +--- + +# Data Types & Controlling the Schema + +```{r} +#| label: load-packages +library(arrow) +library(dplyr) +library(tictoc) +``` + +```{r} +#| label: open-dataset-seattle-csv +seattle_csv <- open_dataset( + sources = "data/seattle-library-checkouts.csv", + format = "csv" +) +``` + + +## Problems + +1. The first few thousand rows of `ISBN` are blank in the Seattle Checkouts CSV file. Read in the Seattle Checkouts CSV file with `open_dataset()` and ensure the correct data type for `ISBN` is `` instead of the `` interpreted by Arrow. + +2. Once you have a `Dataset` object with the metadata you are after, count the number of `Checkouts` by `CheckoutYear` and arrange the result by `CheckoutYear`. + +## Solution 1 + +```{r} +#| label: seattle-csv-schema-1 + +``` + +## Solution 2 + +The number of `Checkouts` by `CheckoutYear` arranged by `CheckoutYear`: + +```{r} +#| label: seattle-csv-dplyr-1 + +``` + + +# Parquet + +```{r} +#| label: write-dataset-seattle-parquet +#| eval: false +seattle_parquet <- "data/seattle-library-checkouts-parquet" + +seattle_csv |> + write_dataset(path = seattle_parquet, + format = "parquet") +``` + + +## Problem + +1. Re-run the query counting the number of `Checkouts` by `CheckoutYear` and arranging the result by `CheckoutYear`, this time using the Seattle Checkout data saved to disk as a single, Parquet file. Did you notice a difference in compute time? + +## Solution 1 + +```{r} +#| label: seattle-parquet-dplyr-1 + +``` + + +# Partitioning + +```{r} +#| label: write-dataset-seattle-partitioned +#| eval: false +seattle_parquet_part <- "data/seattle-library-checkouts" + +seattle_csv |> + group_by(CheckoutYear) |> + write_dataset(path = seattle_parquet_part, + format = "parquet") +``` + + +## Problems + +1. Let's write the Seattle Checkout CSV data to a multi-file dataset just one more time! This time, write the data partitioned by `CheckoutType` as Parquet files. + +2. Now compare the compute time between our Parquet data partitioned by `CheckoutYear` and our Parquet data partitioned by `CheckoutType` with a query of the total number of checkouts in September of 2019. Did you find a difference in compute time? + +## Solution 1 + +Writing the data: + +```{r} +#| label: write-dataset-seattle-checkouttype + +``` + +## Solution 2 + +Total number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutType`: + +```{r} +#| label: seattle-partitioned-other-dplyr + +``` + +Total number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutYear` and `CheckoutMonth`: + +```{r} +#| label: seattle-partitioned-partitioned-filter-dplyr + +``` + diff --git a/exercises/4_data_manipulation_2_exercises-participants.qmd b/exercises/4_data_manipulation_2_exercises-participants.qmd new file mode 100644 index 0000000..24543c5 --- /dev/null +++ b/exercises/4_data_manipulation_2_exercises-participants.qmd @@ -0,0 +1,50 @@ +--- +title: "Data Manipulation Part 2 - Exercises" +execute: + echo: true + messages: false + warning: false + cache: true +--- + +```{r} +#| label: load-packages +library(arrow) +library(dplyr) +``` + +```{r} +#| label: open-dataset +nyc_taxi <- open_dataset("data/nyc-taxi") +nyc_taxi +``` + + +# User-defined functions + +## Problem + +1. Write a user-defined function which wraps the `stringr` function `str_replace_na()`, and use it to replace any `NA` values in the `vendor_name` column with the string "No vendor" instead. (Test it on the data from 2019 so you're not pulling everything into memory) + +## Solution 1 + +```{r} +#| label: udf-solution + +``` + + +# Joins + +## Problem + +1. How many taxi pickups were recorded in 2019 from the three major airports covered by the NYC Taxis data set (JFK, LaGuardia, Newark)? (Hint: you can use `stringr::str_detect()` to help you find pickup zones with the word "Airport" in them) + +## Solution 1 + +```{r} +#| label: airport-pickup + +``` + + diff --git a/exercises/5_arrow_single_file_exercises-participants.qmd b/exercises/5_arrow_single_file_exercises-participants.qmd new file mode 100644 index 0000000..2c52369 --- /dev/null +++ b/exercises/5_arrow_single_file_exercises-participants.qmd @@ -0,0 +1,38 @@ +--- +title: "Arrow In-Memory Exercise" +execute: + echo: true + messages: false + warning: false + cache: true +--- + +```{r} +#| label: load-packages +library(arrow) +library(dplyr) +``` + + +## Arrow Table + +## Problem + +1. Read in a single NYC Taxi parquet file using `read_parquet()` as an Arrow Table + +2. Convert your Arrow Table object to a `data.frame` or a `tibble` + +## Solution 1 + +```{r} +#| label: arrow-table-read + +``` + +## Solution 2 + +```{r} +#| label: table-to-tibble + +``` +