Skip to content

Commit

Permalink
Merge pull request #13 from FAST-HEP/kreczko-first-light
Browse files Browse the repository at this point in the history
Very basic functionality
  • Loading branch information
kreczko authored Feb 22, 2024
2 parents c177124 + d6c6f6a commit d5dc127
Show file tree
Hide file tree
Showing 21 changed files with 440 additions and 54 deletions.
6 changes: 1 addition & 5 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,13 +40,9 @@ jobs:
strategy:
fail-fast: false
matrix:
python-version: ["3.8", "3.11", "3.12"]
python-version: ["3.10", "3.11"]
runs-on: [ubuntu-latest, macos-latest, windows-latest]

include:
- python-version: pypy-3.9
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v4
with:
Expand Down
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -159,3 +159,6 @@ Thumbs.db

# copier
.copier*

# VSCode
.vscode/
2 changes: 2 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,9 @@ repos:
files: src|tests
args: []
additional_dependencies:
- plumbum
- pytest
- pydantic

- repo: https://github.com/codespell-project/codespell
rev: "v2.2.5"
Expand Down
90 changes: 90 additions & 0 deletions docs/examples/hello_world.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# Hello world

Sometimes you just want to see some code. This section contains some real-life
examples of how to use `fasthep-flow`.

```yaml
stages:
- name: "hello_world in bash"
type: "fasthep_flow.operators.BashOperator"
kwargs:
bash_command: echo "Hello World!"
```
Save this to a file called `hello_world.yaml`.

```bash
fasthep-flow execute hello_world.yaml
```

This will print "Hello World!" to the console.

So far so good, but what does it actually do? Let's to execute this
step-by-step.

## Creating a flow

The first thing that `fasthep-flow` does is to create a flow. This is done by
creating a `prefect.Flow` object, and adding a task for each step in the YAML
file. The task is created by the `fasthep-flow` operator, and the parameters are
passed to the task as keyword arguments.

We can do this ourselves by creating a flow and adding a task to it.

```python
from fasthep.operators import BashOperator
from prefect import Flow
flow = Flow("hello_world")
task = BashOperator(bash_command="echo 'Hello World!'")
flow.add_task(task)
```

## Running the flow

Next we have to decide how to execute this flow. By default, `fasthep-flow` will
run the flow on the local machine. This is done by calling `flow.run()`.

```python
flow.run()
```

## Running the flow on a cluster

The real strength of `fasthep-flow` is that it can run the flow on a cluster
with the same config file. Internally, this is done by creating a Dask workflow
first, and then running it on the specified cluster (e.g. HTCondor or Google
Cloud Composer). For now, let's just run it on a local Dask cluster.

```bash
fasthep-flow execute hello_world.yaml --executor DaskLocal
```

This will start a Dask cluster on your local machine, and run the flow on it.
While the output will be the same, you will find additional output files for
Dask performance.

## Provenance

In a real-world scenario, you would want to keep track of the provenance of your
flow. This is done automatically by `fasthep-flow`, and you can find the
provenance in the `output/provenance` folder.

For more information, see [Provenance](./provenance.md).

So what does this look like for our hello world example?

```bash
tree output
```

## Next steps

This was a very simple example, but it shows the basic concepts of
`fasthep-flow`. For more realistic examples, see the experiment specific
examples in [Examples](./examples/index.md). For more advanced examples, see
[Advanced Examples](./advanced_examples/index.md).

```

```
11 changes: 11 additions & 0 deletions docs/examples/hello_world.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
stages:
- name: "hello_world in bash"
type: "fasthep_flow.operators.BashOperator"
kwargs:
bash_command: echo
arguments: ["Hello World!"]
- name: "touch /tmp/date.txt"
type: "fasthep_flow.operators.BashOperator"
kwargs:
bash_command: touch
arguments: ["/tmp/date.txt"]
1 change: 1 addition & 0 deletions docs/examples/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ Documentation is useful, but sometimes you just want to see some code. This
section contains some real-life examples of how to use `fasthep-flow`.

```{toctree}
hello_world.md
atlas_example.md
cms_pub_example.md
dune_example.md
Expand Down
13 changes: 8 additions & 5 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,22 @@

## Introduction

`fasthep-flow` is a package for converting YAML files into an Apache Airflow
DAG. It is designed to be used with the [fast-hep](https://fast-hep.github.io/)
package ecosystem, but can be used independently.
`fasthep-flow` is a package for converting YAML files into an
[prefect]$(prefect-link) flow. It is designed to be used with the
[fast-hep](https://fast-hep.github.io/) package ecosystem, but can be used
independently.

The goal of this package is to define a workflow, e.g. a HEP analysis, in a YAML
file, and then convert that YAML file into an Apache Airflow DAG. This DAG can
file, and then convert that YAML file into an
[prefect flow](https://docs.prefect.io/latest/concepts/flows/). This flow can
then be run on a local machine, or on a cluster using
[CERN's HTCondor](https://batchdocs.web.cern.ch/local/submit.html) (via Dask) or
[Google Cloud Composer](https://cloud.google.com/composer).

In `fasthep-flow`'s YAML files draws inspiration from Continuous Integration
(CI) pipelines and Ansible Playbooks to define the workflow, where each step is
a task that can be run in parallel. `fasthep-flow` will check the parameters of
each task, and then generate the DAG. The DAG will have a task for each step,
each task, and then generate the flow. The flow will have a task for each step,
and the dependencies between the tasks will be defined by the `needs` key in the
YAML file. More on this under [Configuration](./configuration/index.md).

Expand All @@ -39,6 +41,7 @@ concepts.md
configuration/index.md
operators.md
executors.md
provenance.md
examples/index.md
advanced_examples/index.md
command_line_interface.md
Expand Down
49 changes: 49 additions & 0 deletions docs/provenance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Provenance

## What is Provenance?

Provenance refers to the detailed history of the origin, lineage, and changes
made to data throughout its lifecycle. It encompasses the documentation of
processes, inputs, outputs, and transformations that data undergoes, providing a
comprehensive audit trail that can be used to verify the data's integrity and
authenticity.

## Why Does Provenance Matter for Scientific Data Analysis?

In scientific data analysis, provenance is crucial as it ensures the
reproducibility and reliability of results. It enables researchers to trace back
through the analysis workflow to understand how data was altered, what
computational steps were performed, and by whom. This traceability is essential
for validating research findings, facilitating peer reviews, and enabling other
researchers to replicate and build upon the work.

## Our Approach to Provenance in the YAML Config

To integrate provenance into our workflows, we introduce a dedicated provenance
section within the YAML configuration. This section describes which metadata
should be captured, e.g. version of the dataset used, the origin of the data,
the specific parameters set for each analysis stage, and the individual
responsible for each step (taken from git history). By embedding this
information directly into the workflow configuration, we ensure that every step
of data processing is transparent and traceable. This not only adheres to best
practices in scientific data handling but also empowers users to conduct robust
and transparent analyses.

### Example

```yaml
provenance:
datasets:
source: fasthep-curator # Specifies the tool used for dataset curation
analysis:
include:
- steps # Enumerates the individual steps taken in the analysis
- parameters # Parameters used at each step for reproducibility
- git # Git commit hash, branch, and status for version control
- performance # Metrics to measure the efficiency of the analysis
- environment # Software environment, including library versions
- hardware # Hardware specifications where the analysis was run
airflow:
include:
- db # Database configurations and states within Airflow
```
17 changes: 16 additions & 1 deletion mypy.ini
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
[mypy]
plugins = pydantic.mypy
files = ["src", "tests"]
python_version = 3.8
python_version = 3.10
warn_unused_configs = true
strict = true
show_error_codes = true
Expand All @@ -13,6 +14,8 @@ disallow_untyped_decorators = false
[mypy-fasthep_flow.*]
disallow_untyped_defs = true
disallow_incomplete_defs = true
implicit_reexport = true
ignore_missing_imports = true

[mypy-typer.*]
implicit_reexport = true
Expand All @@ -21,3 +24,15 @@ ignore_missing_imports = true
[mypy-omegaconf.*]
ignore_missing_imports = true
implicit_reexport = true

[mypy-plumbum.*]
ignore_missing_imports = true
implicit_reexport = true

[mypy-hydra.*]
ignore_missing_imports = true
implicit_reexport = true

[mypy-prefect.*]
ignore_missing_imports = true
implicit_reexport = true
33 changes: 8 additions & 25 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -10,18 +10,16 @@ authors = [
]
description = "Convert YAML into a workflow DAG"
readme = "README.md"
requires-python = ">=3.8"
requires-python = ">=3.10"
classifiers = [
"Development Status :: 1 - Planning",
"Development Status :: 2 - Pre-Alpha",
"Intended Audience :: Science/Research",
"Intended Audience :: Developers",
"License :: OSI Approved :: Apache Software License",
"Operating System :: OS Independent",
"Programming Language :: Python",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3 :: Only",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
Expand All @@ -30,7 +28,10 @@ classifiers = [
]
dynamic = ["version"]
dependencies = [
"apache-airflow >=2.7",
"eval_type_backport",
"hydra-core",
"plumbum",
"prefect",
"pyyaml >=5.4",
"omegaconf >=2.1",
"typer >=0.4",
Expand All @@ -45,6 +46,7 @@ test = [
dev = [
"pytest >=6",
"pytest-cov >=3",
"ruff",
]
docs = [
"sphinx>=7.0",
Expand Down Expand Up @@ -93,25 +95,6 @@ port.exclude_lines = [
'if typing.TYPE_CHECKING:',
]

# [tool.mypy]
# files = ["src", "tests"]
# python_version = "3.8"
# warn_unused_configs = true
# strict = true
# show_error_codes = true
# enable_error_code = ["ignore-without-code", "redundant-expr", "truthy-bool"]
# warn_unreachable = true
# disallow_untyped_defs = false
# disallow_incomplete_defs = false

# [[tool.mypy.overrides]]
# module = "fasthep_flow.*"
# disallow_untyped_defs = true
# disallow_incomplete_defs = true

# [[tool.mypy-typer.*]]
# implicit_reexport = true


[tool.ruff]
select = [
Expand Down Expand Up @@ -160,7 +143,7 @@ isort.required-imports = ["from __future__ import annotations"]


[tool.pylint]
py-version = "3.8"
py-version = "3.10"
ignore-paths = [".*/_version.py"]
reports.output-format = "colorized"
similarities.ignore-imports = "yes"
Expand Down
4 changes: 3 additions & 1 deletion src/fasthep_flow/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,5 +8,7 @@
from __future__ import annotations

from ._version import version as __version__
from .config import FlowConfig
from .workflow import Workflow

__all__ = ("__version__",)
__all__ = ("__version__", "FlowConfig", "Workflow")
Loading

0 comments on commit d5dc127

Please sign in to comment.