Move data loading to start of tutorial

Previously our data downloading/loading was intermingled throughout (and sometimes duplicated in a few places). This change now migrates all data downloading and loading to happen in the first notebook so the latter notebooks can focus on the content.
ibis-project · Jul 2, 2024 · 2d2586e · 2d2586e
1 parent 50435bc
commit 2d2586e
Show file tree

Hide file tree

Showing 18 changed files with 143 additions and 383 deletions.
diff --git a/.devcontainer/compose.yaml b/.devcontainer/compose.yaml
@@ -19,13 +19,16 @@ services:
       POSTGRES_PASSWORD: postgres
       POSTGRES_DB: postgres
       POSTGRES_USER: postgres
+    user: postgres
     image: postgres:15
     healthcheck:
       interval: 1s
       retries: 20
       test:
         - CMD
         - pg_isready
+    ports:
+      - "5432:5432"
     volumes:
       - postgres:/var/lib/postgresql/data
 

diff --git a/00 - Start Here.ipynb b/00 - Start Here.ipynb
@@ -16,74 +16,17 @@
         "1. [Playing with PyPI](./04%20-%20Playing%20with%20PyPI.ipynb)\n",
         "\n",
         "\n",
-        "First, let's kick off a download of some PyPI maintainer data, we'll use this later on."
+        "This tutorial will make use of several datasets (and databases!). Before we\n",
+        "start we need to download some data and setup some databases."
       ]
     },
     {
       "cell_type": "code",
       "metadata": {},
       "source": [
-        "import urllib.request\n",
-        "from pathlib import Path\n",
+        "from setup_data import setup\n",
         "\n",
-        "## Download PyPI maintainer data from Ibis Tutorial bucket\n",
-        "\n",
-        "filenames = [\n",
-        "    \"deps.parquet\",\n",
-        "    \"maintainers.parquet\",\n",
-        "    \"package_urls.parquet\",\n",
-        "    \"packages.parquet\",\n",
-        "    \"scorecard_checks.parquet\",\n",
-        "    \"wheels.parquet\",\n",
-        "]\n",
-        "\n",
-        "folder = Path(\"pypi\")\n",
-        "folder.mkdir(exist_ok=True)\n",
-        "\n",
-        "for filename in filenames:\n",
-        "    path = folder / filename\n",
-        "    if not path.exists():\n",
-        "        print(f\"Downloading {filename} to {path}\")\n",
-        "        urllib.request.urlretrieve(\n",
-        "            f\"https://storage.googleapis.com/ibis-tutorial-data/pypi/2024-04-24/{filename}\",\n",
-        "            path,\n",
-        "        )"
-      ],
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "Let's continue by loading some IMDB ratings data into a local PostgreSQL database!\n",
-        "We will do this using DuckDB, yes you can do that!"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "metadata": {},
-      "source": [
-        "!curl -OLsS 'https://storage.googleapis.com/ibis-tutorial-data/imdb/2024-03-22/imdb_title_ratings.parquet'\n",
-        "!curl -OLsS 'https://storage.googleapis.com/ibis-tutorial-data/imdb/2024-03-22/imdb_title_basics.parquet'\n",
-        "!psql < demo/create_imdb.sql\n",
-        "!duckdb < load_imdb.sql"
-      ],
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "And we'll confirm that our PostgreSQL database contains the tables we just loaded."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "metadata": {},
-      "source": [
-        "!psql < verify.sql"
+        "setup()"
       ],
       "execution_count": null,
       "outputs": []

diff --git a/01 - Getting Started.ipynb b/01 - Getting Started.ipynb
@@ -10,44 +10,6 @@
         "\n",
         "See the [README](https://github.com/ibis-project/ibis-tutorial#setup) for up-to-date installation instructions!\n",
         "\n",
-        "## Download some data\n",
-        "\n",
-        "There are other ways to get example data, but we'll start by downloading the penguins dataset.\n",
-        "$^1$."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "metadata": {},
-      "source": [
-        "from pathlib import Path\n",
-        "\n",
-        "import duckdb\n",
-        "from packaging.version import parse as vparse\n",
-        "\n",
-        "duck_version = vparse(\"0.10\")  # backwards compatible with duckdb==1.0\n",
-        "\n",
-        "ddb_file = Path(\"palmer_penguins.ddb\")\n",
-        "\n",
-        "\n",
-        "if not ddb_file.exists():\n",
-        "    import urllib.request\n",
-        "\n",
-        "    urllib.request.urlretrieve(\n",
-        "        f\"https://storage.googleapis.com/ibis-tutorial-data/penguins/0.{duck_version.minor}/palmer_penguins.ddb\",\n",
-        "        ddb_file,\n",
-        "    )"
-      ],
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "DuckDB is similar to sqlite -- we have a single file on disk (or an in-memory\n",
-        "connection) that we can operate on.\n",
-        "\n",
         "## Intro\n",
         "\n",
         "We can begin by importing Ibis and firing up a connection to DuckDB! (DuckDB is\n",
@@ -61,7 +23,7 @@
       "source": [
         "import ibis\n",
         "\n",
-        "con = ibis.duckdb.connect(\"palmer_penguins.ddb\", read_only=True)"
+        "con = ibis.duckdb.connect(\"data/penguins/palmer_penguins.ddb\", read_only=True)"
       ],
       "execution_count": null,
       "outputs": []
@@ -70,11 +32,6 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "**Note**: when you connect to a DuckDB database file, DuckDB will create a\n",
-        "  lock-file to prevent data corruption.  If you see a `palmer_penguins.ddb.wal`\n",
-        "  file, you can safely ignore it. It will get cleaned up automatically.\n",
-        "\n",
-        "\n",
         "Now we have a connection, we can start by looking around.  Are there any tables\n",
         "in this database (one would hope)?\n"
       ]

diff --git a/02 - Ibis and the Python Ecosystem.ipynb b/02 - Ibis and the Python Ecosystem.ipynb
@@ -316,7 +316,7 @@
       "cell_type": "code",
       "metadata": {},
       "source": [
-        "con = ibis.duckdb.connect(\"palmer_penguins.ddb\", read_only=True)\n",
+        "con = ibis.duckdb.connect(\"data/penguins/palmer_penguins.ddb\", read_only=True)\n",
         "penguins = con.table(\"penguins\")\n",
         "\n",
         "penguins"
@@ -357,14 +357,14 @@
         "### Exercise 2: Add a Column for Scientific Name\n",
         "\n",
         "Like all species, the penguins here have scientific names. These are available\n",
-        "in the `penguin_species.jsonl` file in the tutorial repo."
+        "in the `data/penguins/species.jsonl` file in the tutorial repo."
       ]
     },
     {
       "cell_type": "code",
       "metadata": {},
       "source": [
-        "!cat penguin_species.jsonl"
+        "!cat data/penguins/penguin_species.jsonl"
       ],
       "execution_count": null,
       "outputs": []
@@ -378,8 +378,9 @@
         "`ibis` as a memtable.\n",
         "\n",
         "Your job is to:\n",
-        "- Read in the `penguin_species.jsonl` file. You might find the\n",
-        "  `pandas.read_json` function useful (note you'll need to pass in `lines=True`)\n",
+        "\n",
+        "- Read in the `species.jsonl` file. You might find the `pandas.read_json`\n",
+        "  function useful (note you'll need to pass in `lines=True`)\n",
         "- Coerce it to a `memtable`.\n",
         "- Join the original `penguins` table with the new `species` memtable to label\n",
         "  every row with its proper scientific name.\n",

diff --git a/03 - Switching Backends.ipynb b/03 - Switching Backends.ipynb
@@ -16,54 +16,17 @@
         "\n",
         "## IMDB Dataset\n",
         "\n",
-        "For this section, we'll use some of Ibis' built-in example datasets,\n",
-        "specifically, some IMDB data.\n",
-        "\n",
-        "**Note**: the full data for both of these tables is available in\n",
-        "`ibis.examples.imdb_title_ratings` and `ibis.examples.imdb_title_basics`, but\n",
-        "we're not using those in-person to avoid everyone downloading the same 250mb\n",
-        "file at once."
+        "For this section, we'll make use of the IMDB dataset, which provides a\n",
+        "[snapshot](https://datasets.imdbws.com/) of the films and ratings on IMDB. This\n",
+        "dataset was downloaded at the start of the tutorial, and is available as a set of\n",
+        "parquet files in the `data/imdb/` directory:\n"
       ]
     },
     {
       "cell_type": "code",
       "metadata": {},
       "source": [
-        "from pathlib import Path\n",
-        "\n",
-        "filenames = [\n",
-        "    \"imdb_title_basics_sample_5.parquet\",\n",
-        "    \"imdb_title_ratings.parquet\",\n",
-        "]\n",
-        "\n",
-        "folder = Path(\"imdb_smol\")\n",
-        "folder.mkdir(exist_ok=True)"
-      ],
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "code",
-      "metadata": {},
-      "source": [
-        "for filename in filenames:\n",
-        "    path = folder / filename\n",
-        "    if not path.exists():\n",
-        "        import urllib.request\n",
-        "\n",
-        "        urllib.request.urlretrieve(\n",
-        "            f\"https://storage.googleapis.com/ibis-tutorial-data/imdb/2024-03-22/{filename}\",\n",
-        "            path,\n",
-        "        )"
-      ],
-      "execution_count": null,
-      "outputs": []
-    },
-    {
-      "cell_type": "code",
-      "metadata": {},
-      "source": [
-        "!ls imdb_smol/"
+        "!ls data/imdb/"
       ],
       "execution_count": null,
       "outputs": []
@@ -72,13 +35,17 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
+        "Here we'll load both the `imdb_title_ratings` and `imdb_title_basics_sample_5`\n",
+        "files to work with a sample of the data. Once we have a query we're happy with\n",
+        "we'll run the same query on the full dataset.\n",
+        "\n",
         "### Parquet loading\n",
         "\n",
         "In the previous examples we used a pre-existing DuckDB database, and some\n",
         "in-memory tables. Another common pattern is that you have a few parquet files\n",
         "you want to work with. We can load those in to an in-memory DuckDB connection.\n",
         "(Note that \"in-memory\" here just means ephemeral, DuckDB is still very happy to\n",
-        "operate on as much data as your hard drive can hold)"
+        "operate on as much data as your hard drive can hold)."
       ]
     },
     {
@@ -107,7 +74,7 @@
       "metadata": {},
       "source": [
         "basics = con.read_parquet(\n",
-        "    \"imdb_smol/imdb_title_basics_sample_5.parquet\", table_name=\"imdb_title_basics\"\n",
+        "    \"data/imdb/imdb_title_basics_sample_5.parquet\", table_name=\"imdb_title_basics\"\n",
         ")"
       ],
       "execution_count": null,
@@ -118,7 +85,7 @@
       "metadata": {},
       "source": [
         "ratings = con.read_parquet(\n",
-        "    \"imdb_smol/imdb_title_ratings.parquet\", table_name=\"imdb_title_ratings\"\n",
+        "    \"data/imdb/imdb_title_ratings.parquet\", table_name=\"imdb_title_ratings\"\n",
         ")"
       ],
       "execution_count": null,

diff --git a/04 - Playing with PyPI.ipynb b/04 - Playing with PyPI.ipynb
@@ -53,7 +53,7 @@
         "    \"wheels.parquet\",\n",
         "]\n",
         "\n",
-        "folder = Path(\"pypi\")\n",
+        "folder = Path(\"data/pypi\")\n",
         "\n",
         "for filename in filenames:\n",
         "    path = folder / filename\n",

diff --git a/PYCON_WELCOME.md b/PYCON_WELCOME.md
diff --git a/penguin_species.jsonl → data/penguins/species.jsonl b/penguin_species.jsonl → data/penguins/species.jsonl
diff --git a/load_imdb.sql b/load_imdb.sql