Training project to make Prefect managed Databricks pipelines.
CI (GitHub Actions) will run checks, tests and deploy the notebooks to the Databricks server and Prefect Flows to a Prefect Cloud.
Note
This project is still in WIP
Following things will be needed:
- Prefect Cloud account
- Azure account (For 'Azure Databricks' and 'Azure Blob Storage')
- Setup Azure Databricks and create token for your account.
- Create container
flows
in Azure Storage. - Prepare
.env
file from an.env_template
:cp .env_template .env
and fill your secrets. - Launch local Prefect Agent and Prefect CLI with
docker compose
:docker compose up --remove-orphans --force-recreate --pull always # To cleanup after: # docker compose down --rmi all --volumes # docker system prune --all --volumes --force # warning! removes all images and volumes!
- Register your storage as a Block into a Prefect Cloud:
docker exec -it databricks_pipelines-cli-1 python3 src/flows/maintenance/make_block_remote_storage.py
- Deploy two existing flows (also will be done by CI):
docker exec -it databricks_pipelines-cli-1 bash -c "python3 one.py && python3 two.py" # or manually: docker exec -it databricks_pipelines-cli-1 bash
Now you should be able to trigger Databricks jobs from Prefect Cloud UI.
GitHub Actions CI/CD flow defined under .github/workflows
:
---
title: CI flow
---
flowchart LR
subgraph pr[Pull request flow]
direction TB
A1[Install Python and dependencies] -->
B1[Static checks] -->
C1[Unit tests] -->
D1[Upload test results]
end
subgraph deploy[Merge to master flow]
direction TB
A2[Upload notebooks to Databricks] -->
B2[Build and upload Python lib to Databricks] -->
C2[Deploy Prefect Flows to Prefect Cloud]
end
pr --> deploy