-
Notifications
You must be signed in to change notification settings - Fork 7
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
add exercise scripts for participants
- Loading branch information
1 parent
aa439f8
commit bd50467
Showing
6 changed files
with
332 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,6 +2,7 @@ project: | |
type: website | ||
render: | ||
- "*.qmd" | ||
- "!exercises/" | ||
execute-dir: project | ||
|
||
website: | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
--- | ||
title: "Hello Arrow Exercises" | ||
execute: | ||
echo: true | ||
messages: false | ||
warning: false | ||
cache: true | ||
--- | ||
|
||
```{r} | ||
#| label: load-packages | ||
library(arrow) | ||
library(dplyr) | ||
library(tictoc) | ||
``` | ||
|
||
|
||
```{r} | ||
#| label: open-dataset | ||
nyc_taxi <- open_dataset("data/nyc-taxi") | ||
nyc_taxi | ||
``` | ||
|
||
|
||
## First dplyr pipeline with Arrow | ||
|
||
## Problems | ||
|
||
1. Calculate the longest trip distance for every month in 2019 | ||
|
||
2. How long did this query take to run? | ||
|
||
## Solution 1 | ||
|
||
Longest trip distance for every month in 2019: | ||
|
||
```{r} | ||
#| label: first-dplyr-exercise1 | ||
``` | ||
|
||
## Solution 2 | ||
|
||
Compute time: | ||
|
||
```{r} | ||
#| label: first-dplyr-exercise2 | ||
``` | ||
|
73 changes: 73 additions & 0 deletions
73
exercises/2_data_manipulation_1_exercises-participants.qmd
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
--- | ||
title: "Data Manipulation Part 1 - Exercises" | ||
execute: | ||
echo: true | ||
messages: false | ||
warning: false | ||
cache: true | ||
--- | ||
|
||
```{r} | ||
#| label: load-packages | ||
library(arrow) | ||
library(dplyr) | ||
library(stringr) | ||
``` | ||
|
||
```{r} | ||
#| label: open-dataset | ||
nyc_taxi <- open_dataset("data/nyc-taxi") | ||
nyc_taxi | ||
``` | ||
|
||
|
||
# Using `collect()` to run a query | ||
|
||
## Problems | ||
|
||
Use the function `nrow()` to work out the answers to these questions: | ||
|
||
1. How many taxi fares in the dataset had a total amount greater than $100? | ||
|
||
|
||
## Solution 1 | ||
|
||
```{r} | ||
#| label: collect-1 | ||
``` | ||
|
||
|
||
|
||
# Using the dplyr API in arrow | ||
|
||
## Problems | ||
|
||
1. Use the `dplyr::filter()` and `stringr::str_ends()` functions to return a subset of the data which is a) from September 2020, and b) the value in `vendor_name` ends with the letter "S". | ||
|
||
2. Try to use the `stringr` function `str_replace_na()` to replace any `NA` values in the `vendor_name` column with the string "No vendor" instead. What happens, and why? | ||
|
||
3. Bonus question: see if you can find a different way of completing the task in question 2. | ||
|
||
## Solution 1 | ||
|
||
```{r} | ||
#| label: collect-sol1 | ||
``` | ||
|
||
## Solution 2 | ||
|
||
```{r} | ||
#| label: collect-sol2 | ||
``` | ||
|
||
## Solution 3 | ||
|
||
```{r} | ||
#| label: collect-sol3 | ||
``` | ||
|
||
|
120 changes: 120 additions & 0 deletions
120
exercises/3_data_engineering_exercises-participants.qmd
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,120 @@ | ||
--- | ||
title: "Data Engineering with Arrow Exercises" | ||
execute: | ||
echo: true | ||
messages: false | ||
warning: false | ||
cache: true | ||
--- | ||
|
||
# Data Types & Controlling the Schema | ||
|
||
```{r} | ||
#| label: load-packages | ||
library(arrow) | ||
library(dplyr) | ||
library(tictoc) | ||
``` | ||
|
||
```{r} | ||
#| label: open-dataset-seattle-csv | ||
seattle_csv <- open_dataset( | ||
sources = "data/seattle-library-checkouts.csv", | ||
format = "csv" | ||
) | ||
``` | ||
|
||
|
||
## Problems | ||
|
||
1. The first few thousand rows of `ISBN` are blank in the Seattle Checkouts CSV file. Read in the Seattle Checkouts CSV file with `open_dataset()` and ensure the correct data type for `ISBN` is `<string>` instead of the `<null>` interpreted by Arrow. | ||
|
||
2. Once you have a `Dataset` object with the metadata you are after, count the number of `Checkouts` by `CheckoutYear` and arrange the result by `CheckoutYear`. | ||
|
||
## Solution 1 | ||
|
||
```{r} | ||
#| label: seattle-csv-schema-1 | ||
``` | ||
|
||
## Solution 2 | ||
|
||
The number of `Checkouts` by `CheckoutYear` arranged by `CheckoutYear`: | ||
|
||
```{r} | ||
#| label: seattle-csv-dplyr-1 | ||
``` | ||
|
||
|
||
# Parquet | ||
|
||
```{r} | ||
#| label: write-dataset-seattle-parquet | ||
#| eval: false | ||
seattle_parquet <- "data/seattle-library-checkouts-parquet" | ||
seattle_csv |> | ||
write_dataset(path = seattle_parquet, | ||
format = "parquet") | ||
``` | ||
|
||
|
||
## Problem | ||
|
||
1. Re-run the query counting the number of `Checkouts` by `CheckoutYear` and arranging the result by `CheckoutYear`, this time using the Seattle Checkout data saved to disk as a single, Parquet file. Did you notice a difference in compute time? | ||
|
||
## Solution 1 | ||
|
||
```{r} | ||
#| label: seattle-parquet-dplyr-1 | ||
``` | ||
|
||
|
||
# Partitioning | ||
|
||
```{r} | ||
#| label: write-dataset-seattle-partitioned | ||
#| eval: false | ||
seattle_parquet_part <- "data/seattle-library-checkouts" | ||
seattle_csv |> | ||
group_by(CheckoutYear) |> | ||
write_dataset(path = seattle_parquet_part, | ||
format = "parquet") | ||
``` | ||
|
||
|
||
## Problems | ||
|
||
1. Let's write the Seattle Checkout CSV data to a multi-file dataset just one more time! This time, write the data partitioned by `CheckoutType` as Parquet files. | ||
|
||
2. Now compare the compute time between our Parquet data partitioned by `CheckoutYear` and our Parquet data partitioned by `CheckoutType` with a query of the total number of checkouts in September of 2019. Did you find a difference in compute time? | ||
|
||
## Solution 1 | ||
|
||
Writing the data: | ||
|
||
```{r} | ||
#| label: write-dataset-seattle-checkouttype | ||
``` | ||
|
||
## Solution 2 | ||
|
||
Total number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutType`: | ||
|
||
```{r} | ||
#| label: seattle-partitioned-other-dplyr | ||
``` | ||
|
||
Total number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutYear` and `CheckoutMonth`: | ||
|
||
```{r} | ||
#| label: seattle-partitioned-partitioned-filter-dplyr | ||
``` | ||
|
50 changes: 50 additions & 0 deletions
50
exercises/4_data_manipulation_2_exercises-participants.qmd
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
--- | ||
title: "Data Manipulation Part 2 - Exercises" | ||
execute: | ||
echo: true | ||
messages: false | ||
warning: false | ||
cache: true | ||
--- | ||
|
||
```{r} | ||
#| label: load-packages | ||
library(arrow) | ||
library(dplyr) | ||
``` | ||
|
||
```{r} | ||
#| label: open-dataset | ||
nyc_taxi <- open_dataset("data/nyc-taxi") | ||
nyc_taxi | ||
``` | ||
|
||
|
||
# User-defined functions | ||
|
||
## Problem | ||
|
||
1. Write a user-defined function which wraps the `stringr` function `str_replace_na()`, and use it to replace any `NA` values in the `vendor_name` column with the string "No vendor" instead. (Test it on the data from 2019 so you're not pulling everything into memory) | ||
|
||
## Solution 1 | ||
|
||
```{r} | ||
#| label: udf-solution | ||
``` | ||
|
||
|
||
# Joins | ||
|
||
## Problem | ||
|
||
1. How many taxi pickups were recorded in 2019 from the three major airports covered by the NYC Taxis data set (JFK, LaGuardia, Newark)? (Hint: you can use `stringr::str_detect()` to help you find pickup zones with the word "Airport" in them) | ||
|
||
## Solution 1 | ||
|
||
```{r} | ||
#| label: airport-pickup | ||
``` | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
--- | ||
title: "Arrow In-Memory Exercise" | ||
execute: | ||
echo: true | ||
messages: false | ||
warning: false | ||
cache: true | ||
--- | ||
|
||
```{r} | ||
#| label: load-packages | ||
library(arrow) | ||
library(dplyr) | ||
``` | ||
|
||
|
||
## Arrow Table | ||
|
||
## Problem | ||
|
||
1. Read in a single NYC Taxi parquet file using `read_parquet()` as an Arrow Table | ||
|
||
2. Convert your Arrow Table object to a `data.frame` or a `tibble` | ||
|
||
## Solution 1 | ||
|
||
```{r} | ||
#| label: arrow-table-read | ||
``` | ||
|
||
## Solution 2 | ||
|
||
```{r} | ||
#| label: table-to-tibble | ||
``` | ||
|