Skip to content

Commit

Permalink
Move data loading to start of tutorial
Browse files Browse the repository at this point in the history
Previously our data downloading/loading was intermingled throughout (and
sometimes duplicated in a few places). This change now migrates all data
downloading and loading to happen in the first notebook so the latter
notebooks can focus on the content.
  • Loading branch information
jcrist committed Jul 2, 2024
1 parent 50435bc commit 2d2586e
Show file tree
Hide file tree
Showing 18 changed files with 143 additions and 383 deletions.
3 changes: 3 additions & 0 deletions .devcontainer/compose.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,16 @@ services:
POSTGRES_PASSWORD: postgres
POSTGRES_DB: postgres
POSTGRES_USER: postgres
user: postgres
image: postgres:15
healthcheck:
interval: 1s
retries: 20
test:
- CMD
- pg_isready
ports:
- "5432:5432"
volumes:
- postgres:/var/lib/postgresql/data

Expand Down
65 changes: 4 additions & 61 deletions 00 - Start Here.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -16,74 +16,17 @@
"1. [Playing with PyPI](./04%20-%20Playing%20with%20PyPI.ipynb)\n",
"\n",
"\n",
"First, let's kick off a download of some PyPI maintainer data, we'll use this later on."
"This tutorial will make use of several datasets (and databases!). Before we\n",
"start we need to download some data and setup some databases."
]
},
{
"cell_type": "code",
"metadata": {},
"source": [
"import urllib.request\n",
"from pathlib import Path\n",
"from setup_data import setup\n",
"\n",
"## Download PyPI maintainer data from Ibis Tutorial bucket\n",
"\n",
"filenames = [\n",
" \"deps.parquet\",\n",
" \"maintainers.parquet\",\n",
" \"package_urls.parquet\",\n",
" \"packages.parquet\",\n",
" \"scorecard_checks.parquet\",\n",
" \"wheels.parquet\",\n",
"]\n",
"\n",
"folder = Path(\"pypi\")\n",
"folder.mkdir(exist_ok=True)\n",
"\n",
"for filename in filenames:\n",
" path = folder / filename\n",
" if not path.exists():\n",
" print(f\"Downloading {filename} to {path}\")\n",
" urllib.request.urlretrieve(\n",
" f\"https://storage.googleapis.com/ibis-tutorial-data/pypi/2024-04-24/{filename}\",\n",
" path,\n",
" )"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's continue by loading some IMDB ratings data into a local PostgreSQL database!\n",
"We will do this using DuckDB, yes you can do that!"
]
},
{
"cell_type": "code",
"metadata": {},
"source": [
"!curl -OLsS 'https://storage.googleapis.com/ibis-tutorial-data/imdb/2024-03-22/imdb_title_ratings.parquet'\n",
"!curl -OLsS 'https://storage.googleapis.com/ibis-tutorial-data/imdb/2024-03-22/imdb_title_basics.parquet'\n",
"!psql < demo/create_imdb.sql\n",
"!duckdb < load_imdb.sql"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And we'll confirm that our PostgreSQL database contains the tables we just loaded."
]
},
{
"cell_type": "code",
"metadata": {},
"source": [
"!psql < verify.sql"
"setup()"
],
"execution_count": null,
"outputs": []
Expand Down
45 changes: 1 addition & 44 deletions 01 - Getting Started.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -10,44 +10,6 @@
"\n",
"See the [README](https://github.com/ibis-project/ibis-tutorial#setup) for up-to-date installation instructions!\n",
"\n",
"## Download some data\n",
"\n",
"There are other ways to get example data, but we'll start by downloading the penguins dataset.\n",
"$^1$."
]
},
{
"cell_type": "code",
"metadata": {},
"source": [
"from pathlib import Path\n",
"\n",
"import duckdb\n",
"from packaging.version import parse as vparse\n",
"\n",
"duck_version = vparse(\"0.10\") # backwards compatible with duckdb==1.0\n",
"\n",
"ddb_file = Path(\"palmer_penguins.ddb\")\n",
"\n",
"\n",
"if not ddb_file.exists():\n",
" import urllib.request\n",
"\n",
" urllib.request.urlretrieve(\n",
" f\"https://storage.googleapis.com/ibis-tutorial-data/penguins/0.{duck_version.minor}/palmer_penguins.ddb\",\n",
" ddb_file,\n",
" )"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"DuckDB is similar to sqlite -- we have a single file on disk (or an in-memory\n",
"connection) that we can operate on.\n",
"\n",
"## Intro\n",
"\n",
"We can begin by importing Ibis and firing up a connection to DuckDB! (DuckDB is\n",
Expand All @@ -61,7 +23,7 @@
"source": [
"import ibis\n",
"\n",
"con = ibis.duckdb.connect(\"palmer_penguins.ddb\", read_only=True)"
"con = ibis.duckdb.connect(\"data/penguins/palmer_penguins.ddb\", read_only=True)"
],
"execution_count": null,
"outputs": []
Expand All @@ -70,11 +32,6 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note**: when you connect to a DuckDB database file, DuckDB will create a\n",
" lock-file to prevent data corruption. If you see a `palmer_penguins.ddb.wal`\n",
" file, you can safely ignore it. It will get cleaned up automatically.\n",
"\n",
"\n",
"Now we have a connection, we can start by looking around. Are there any tables\n",
"in this database (one would hope)?\n"
]
Expand Down
11 changes: 6 additions & 5 deletions 02 - Ibis and the Python Ecosystem.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -316,7 +316,7 @@
"cell_type": "code",
"metadata": {},
"source": [
"con = ibis.duckdb.connect(\"palmer_penguins.ddb\", read_only=True)\n",
"con = ibis.duckdb.connect(\"data/penguins/palmer_penguins.ddb\", read_only=True)\n",
"penguins = con.table(\"penguins\")\n",
"\n",
"penguins"
Expand Down Expand Up @@ -357,14 +357,14 @@
"### Exercise 2: Add a Column for Scientific Name\n",
"\n",
"Like all species, the penguins here have scientific names. These are available\n",
"in the `penguin_species.jsonl` file in the tutorial repo."
"in the `data/penguins/species.jsonl` file in the tutorial repo."
]
},
{
"cell_type": "code",
"metadata": {},
"source": [
"!cat penguin_species.jsonl"
"!cat data/penguins/penguin_species.jsonl"
],
"execution_count": null,
"outputs": []
Expand All @@ -378,8 +378,9 @@
"`ibis` as a memtable.\n",
"\n",
"Your job is to:\n",
"- Read in the `penguin_species.jsonl` file. You might find the\n",
" `pandas.read_json` function useful (note you'll need to pass in `lines=True`)\n",
"\n",
"- Read in the `species.jsonl` file. You might find the `pandas.read_json`\n",
" function useful (note you'll need to pass in `lines=True`)\n",
"- Coerce it to a `memtable`.\n",
"- Join the original `penguins` table with the new `species` memtable to label\n",
" every row with its proper scientific name.\n",
Expand Down
57 changes: 12 additions & 45 deletions 03 - Switching Backends.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -16,54 +16,17 @@
"\n",
"## IMDB Dataset\n",
"\n",
"For this section, we'll use some of Ibis' built-in example datasets,\n",
"specifically, some IMDB data.\n",
"\n",
"**Note**: the full data for both of these tables is available in\n",
"`ibis.examples.imdb_title_ratings` and `ibis.examples.imdb_title_basics`, but\n",
"we're not using those in-person to avoid everyone downloading the same 250mb\n",
"file at once."
"For this section, we'll make use of the IMDB dataset, which provides a\n",
"[snapshot](https://datasets.imdbws.com/) of the films and ratings on IMDB. This\n",
"dataset was downloaded at the start of the tutorial, and is available as a set of\n",
"parquet files in the `data/imdb/` directory:\n"
]
},
{
"cell_type": "code",
"metadata": {},
"source": [
"from pathlib import Path\n",
"\n",
"filenames = [\n",
" \"imdb_title_basics_sample_5.parquet\",\n",
" \"imdb_title_ratings.parquet\",\n",
"]\n",
"\n",
"folder = Path(\"imdb_smol\")\n",
"folder.mkdir(exist_ok=True)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {},
"source": [
"for filename in filenames:\n",
" path = folder / filename\n",
" if not path.exists():\n",
" import urllib.request\n",
"\n",
" urllib.request.urlretrieve(\n",
" f\"https://storage.googleapis.com/ibis-tutorial-data/imdb/2024-03-22/{filename}\",\n",
" path,\n",
" )"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {},
"source": [
"!ls imdb_smol/"
"!ls data/imdb/"
],
"execution_count": null,
"outputs": []
Expand All @@ -72,13 +35,17 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we'll load both the `imdb_title_ratings` and `imdb_title_basics_sample_5`\n",
"files to work with a sample of the data. Once we have a query we're happy with\n",
"we'll run the same query on the full dataset.\n",
"\n",
"### Parquet loading\n",
"\n",
"In the previous examples we used a pre-existing DuckDB database, and some\n",
"in-memory tables. Another common pattern is that you have a few parquet files\n",
"you want to work with. We can load those in to an in-memory DuckDB connection.\n",
"(Note that \"in-memory\" here just means ephemeral, DuckDB is still very happy to\n",
"operate on as much data as your hard drive can hold)"
"operate on as much data as your hard drive can hold)."
]
},
{
Expand Down Expand Up @@ -107,7 +74,7 @@
"metadata": {},
"source": [
"basics = con.read_parquet(\n",
" \"imdb_smol/imdb_title_basics_sample_5.parquet\", table_name=\"imdb_title_basics\"\n",
" \"data/imdb/imdb_title_basics_sample_5.parquet\", table_name=\"imdb_title_basics\"\n",
")"
],
"execution_count": null,
Expand All @@ -118,7 +85,7 @@
"metadata": {},
"source": [
"ratings = con.read_parquet(\n",
" \"imdb_smol/imdb_title_ratings.parquet\", table_name=\"imdb_title_ratings\"\n",
" \"data/imdb/imdb_title_ratings.parquet\", table_name=\"imdb_title_ratings\"\n",
")"
],
"execution_count": null,
Expand Down
2 changes: 1 addition & 1 deletion 04 - Playing with PyPI.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@
" \"wheels.parquet\",\n",
"]\n",
"\n",
"folder = Path(\"pypi\")\n",
"folder = Path(\"data/pypi\")\n",
"\n",
"for filename in filenames:\n",
" path = folder / filename\n",
Expand Down
44 changes: 0 additions & 44 deletions PYCON_WELCOME.md

This file was deleted.

File renamed without changes.
6 changes: 0 additions & 6 deletions load_imdb.sql

This file was deleted.

Loading

0 comments on commit 2d2586e

Please sign in to comment.