Skip to content

Commit

Permalink
add exercise scripts for participants
Browse files Browse the repository at this point in the history
  • Loading branch information
stephhazlitt committed Aug 9, 2024
1 parent aa439f8 commit bd50467
Show file tree
Hide file tree
Showing 6 changed files with 332 additions and 0 deletions.
1 change: 1 addition & 0 deletions _quarto.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ project:
type: website
render:
- "*.qmd"
- "!exercises/"
execute-dir: project

website:
Expand Down
50 changes: 50 additions & 0 deletions exercises/1_hello_arrow_exercises-participants.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
---
title: "Hello Arrow Exercises"
execute:
echo: true
messages: false
warning: false
cache: true
---

```{r}
#| label: load-packages
library(arrow)
library(dplyr)
library(tictoc)
```


```{r}
#| label: open-dataset
nyc_taxi <- open_dataset("data/nyc-taxi")
nyc_taxi
```


## First dplyr pipeline with Arrow

## Problems

1. Calculate the longest trip distance for every month in 2019

2. How long did this query take to run?

## Solution 1

Longest trip distance for every month in 2019:

```{r}
#| label: first-dplyr-exercise1
```

## Solution 2

Compute time:

```{r}
#| label: first-dplyr-exercise2
```

73 changes: 73 additions & 0 deletions exercises/2_data_manipulation_1_exercises-participants.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
---
title: "Data Manipulation Part 1 - Exercises"
execute:
echo: true
messages: false
warning: false
cache: true
---

```{r}
#| label: load-packages
library(arrow)
library(dplyr)
library(stringr)
```

```{r}
#| label: open-dataset
nyc_taxi <- open_dataset("data/nyc-taxi")
nyc_taxi
```


# Using `collect()` to run a query

## Problems

Use the function `nrow()` to work out the answers to these questions:

1. How many taxi fares in the dataset had a total amount greater than $100?


## Solution 1

```{r}
#| label: collect-1
```



# Using the dplyr API in arrow

## Problems

1. Use the `dplyr::filter()` and `stringr::str_ends()` functions to return a subset of the data which is a) from September 2020, and b) the value in `vendor_name` ends with the letter "S".

2. Try to use the `stringr` function `str_replace_na()` to replace any `NA` values in the `vendor_name` column with the string "No vendor" instead. What happens, and why?

3. Bonus question: see if you can find a different way of completing the task in question 2.

## Solution 1

```{r}
#| label: collect-sol1
```

## Solution 2

```{r}
#| label: collect-sol2
```

## Solution 3

```{r}
#| label: collect-sol3
```


120 changes: 120 additions & 0 deletions exercises/3_data_engineering_exercises-participants.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
---
title: "Data Engineering with Arrow Exercises"
execute:
echo: true
messages: false
warning: false
cache: true
---

# Data Types & Controlling the Schema

```{r}
#| label: load-packages
library(arrow)
library(dplyr)
library(tictoc)
```

```{r}
#| label: open-dataset-seattle-csv
seattle_csv <- open_dataset(
sources = "data/seattle-library-checkouts.csv",
format = "csv"
)
```


## Problems

1. The first few thousand rows of `ISBN` are blank in the Seattle Checkouts CSV file. Read in the Seattle Checkouts CSV file with `open_dataset()` and ensure the correct data type for `ISBN` is `<string>` instead of the `<null>` interpreted by Arrow.

2. Once you have a `Dataset` object with the metadata you are after, count the number of `Checkouts` by `CheckoutYear` and arrange the result by `CheckoutYear`.

## Solution 1

```{r}
#| label: seattle-csv-schema-1
```

## Solution 2

The number of `Checkouts` by `CheckoutYear` arranged by `CheckoutYear`:

```{r}
#| label: seattle-csv-dplyr-1
```


# Parquet

```{r}
#| label: write-dataset-seattle-parquet
#| eval: false
seattle_parquet <- "data/seattle-library-checkouts-parquet"
seattle_csv |>
write_dataset(path = seattle_parquet,
format = "parquet")
```


## Problem

1. Re-run the query counting the number of `Checkouts` by `CheckoutYear` and arranging the result by `CheckoutYear`, this time using the Seattle Checkout data saved to disk as a single, Parquet file. Did you notice a difference in compute time?

## Solution 1

```{r}
#| label: seattle-parquet-dplyr-1
```


# Partitioning

```{r}
#| label: write-dataset-seattle-partitioned
#| eval: false
seattle_parquet_part <- "data/seattle-library-checkouts"
seattle_csv |>
group_by(CheckoutYear) |>
write_dataset(path = seattle_parquet_part,
format = "parquet")
```


## Problems

1. Let's write the Seattle Checkout CSV data to a multi-file dataset just one more time! This time, write the data partitioned by `CheckoutType` as Parquet files.

2. Now compare the compute time between our Parquet data partitioned by `CheckoutYear` and our Parquet data partitioned by `CheckoutType` with a query of the total number of checkouts in September of 2019. Did you find a difference in compute time?

## Solution 1

Writing the data:

```{r}
#| label: write-dataset-seattle-checkouttype
```

## Solution 2

Total number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutType`:

```{r}
#| label: seattle-partitioned-other-dplyr
```

Total number of Checkouts in September of 2019 using partitioned Parquet data by `CheckoutYear` and `CheckoutMonth`:

```{r}
#| label: seattle-partitioned-partitioned-filter-dplyr
```

50 changes: 50 additions & 0 deletions exercises/4_data_manipulation_2_exercises-participants.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
---
title: "Data Manipulation Part 2 - Exercises"
execute:
echo: true
messages: false
warning: false
cache: true
---

```{r}
#| label: load-packages
library(arrow)
library(dplyr)
```

```{r}
#| label: open-dataset
nyc_taxi <- open_dataset("data/nyc-taxi")
nyc_taxi
```


# User-defined functions

## Problem

1. Write a user-defined function which wraps the `stringr` function `str_replace_na()`, and use it to replace any `NA` values in the `vendor_name` column with the string "No vendor" instead. (Test it on the data from 2019 so you're not pulling everything into memory)

## Solution 1

```{r}
#| label: udf-solution
```


# Joins

## Problem

1. How many taxi pickups were recorded in 2019 from the three major airports covered by the NYC Taxis data set (JFK, LaGuardia, Newark)? (Hint: you can use `stringr::str_detect()` to help you find pickup zones with the word "Airport" in them)

## Solution 1

```{r}
#| label: airport-pickup
```


38 changes: 38 additions & 0 deletions exercises/5_arrow_single_file_exercises-participants.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
---
title: "Arrow In-Memory Exercise"
execute:
echo: true
messages: false
warning: false
cache: true
---

```{r}
#| label: load-packages
library(arrow)
library(dplyr)
```


## Arrow Table

## Problem

1. Read in a single NYC Taxi parquet file using `read_parquet()` as an Arrow Table

2. Convert your Arrow Table object to a `data.frame` or a `tibble`

## Solution 1

```{r}
#| label: arrow-table-read
```

## Solution 2

```{r}
#| label: table-to-tibble
```

0 comments on commit bd50467

Please sign in to comment.