AWS Glue offers a really nice set of tools. However, in order to get started either an AWS account is required or by using a docker image plus some setup.
This repo offers an example docker-compose.yml
file, accompanied by a project setup. You can use this setup to jump start your Glue experimentation.
- Run Glue locally either via jobs or via Jupyter Lab
- Local S3 using Localstack
Docker and docker compose (or similar) is all you need.
S3 setup can optionally be done on container startup. Just edit .aws/buckets.sh. This bash script can contain any set of AWS CLI S3 operations.
Simply run
docker compose up
JupyterLab is available at http://127.0.0.1:8888/. Any notebooks under notebooks/ will be available. A couple of sample notebooks exist to get you started.
AWS CLI can be used for managing buckets and objects. The only requirement is that mock credentials have been defined. Here's an example:
AWS_ACCESS_KEY_ID=mock AWS_SECRET_ACCESS_KEY=mock aws --endpoint-url=http://localhost:4566 s3 ls
All jobs under jobs/ will be copied automatically under /opt/jobs
inside Glue docker container.
Connect to the docker container for glue. The command should be similar to:
docker exec -it local-aws-glue-glue-1 /bin/bash
Then using the container's bash shell use glue-spark-submit
to run a job. For example, you can run orders.py
by running:
glue-spark-submit --master local\[*\] /opt/jobs/orders.py