Merge pull request #13 from FAST-HEP/kreczko-first-light

Very basic functionality
FAST-HEP · Feb 22, 2024 · d5dc127 · d5dc127
2 parents c177124 + d6c6f6a
commit d5dc127
Show file tree

Hide file tree

Showing 21 changed files with 440 additions and 54 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -40,13 +40,9 @@ jobs:
     strategy:
       fail-fast: false
       matrix:
-        python-version: ["3.8", "3.11", "3.12"]
+        python-version: ["3.10", "3.11"]
         runs-on: [ubuntu-latest, macos-latest, windows-latest]
 
-        include:
-          - python-version: pypy-3.9
-            runs-on: ubuntu-latest
-
     steps:
       - uses: actions/checkout@v4
         with:

diff --git a/.gitignore b/.gitignore
@@ -159,3 +159,6 @@ Thumbs.db
 
 # copier
 .copier*
+
+# VSCode
+.vscode/
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -57,7 +57,9 @@ repos:
         files: src|tests
         args: []
         additional_dependencies:
+          - plumbum
           - pytest
+          - pydantic
 
   - repo: https://github.com/codespell-project/codespell
     rev: "v2.2.5"

diff --git a/docs/examples/hello_world.md b/docs/examples/hello_world.md
@@ -0,0 +1,90 @@
+# Hello world
+
+Sometimes you just want to see some code. This section contains some real-life
+examples of how to use `fasthep-flow`.
+
+```yaml
+stages:
+  - name: "hello_world in bash"
+    type: "fasthep_flow.operators.BashOperator"
+    kwargs:
+      bash_command: echo "Hello World!"
+```
+
+Save this to a file called `hello_world.yaml`.
+
+```bash
+fasthep-flow execute hello_world.yaml
+```
+
+This will print "Hello World!" to the console.
+
+So far so good, but what does it actually do? Let's to execute this
+step-by-step.
+
+## Creating a flow
+
+The first thing that `fasthep-flow` does is to create a flow. This is done by
+creating a `prefect.Flow` object, and adding a task for each step in the YAML
+file. The task is created by the `fasthep-flow` operator, and the parameters are
+passed to the task as keyword arguments.
+
+We can do this ourselves by creating a flow and adding a task to it.
+
+```python
+from fasthep.operators import BashOperator
+from prefect import Flow
+
+flow = Flow("hello_world")
+task = BashOperator(bash_command="echo 'Hello World!'")
+flow.add_task(task)
+```
+
+## Running the flow
+
+Next we have to decide how to execute this flow. By default, `fasthep-flow` will
+run the flow on the local machine. This is done by calling `flow.run()`.
+
+```python
+flow.run()
+```
+
+## Running the flow on a cluster
+
+The real strength of `fasthep-flow` is that it can run the flow on a cluster
+with the same config file. Internally, this is done by creating a Dask workflow
+first, and then running it on the specified cluster (e.g. HTCondor or Google
+Cloud Composer). For now, let's just run it on a local Dask cluster.
+
+```bash
+fasthep-flow execute hello_world.yaml --executor DaskLocal
+```
+
+This will start a Dask cluster on your local machine, and run the flow on it.
+While the output will be the same, you will find additional output files for
+Dask performance.
+
+## Provenance
+
+In a real-world scenario, you would want to keep track of the provenance of your
+flow. This is done automatically by `fasthep-flow`, and you can find the
+provenance in the `output/provenance` folder.
+
+For more information, see [Provenance](./provenance.md).
+
+So what does this look like for our hello world example?
+
+```bash
+tree output
+```
+
+## Next steps
+
+This was a very simple example, but it shows the basic concepts of
+`fasthep-flow`. For more realistic examples, see the experiment specific
+examples in [Examples](./examples/index.md). For more advanced examples, see
+[Advanced Examples](./advanced_examples/index.md).
+
+```
+
+```
diff --git a/docs/examples/hello_world.yaml b/docs/examples/hello_world.yaml
@@ -0,0 +1,11 @@
+stages:
+  - name: "hello_world in bash"
+    type: "fasthep_flow.operators.BashOperator"
+    kwargs:
+      bash_command: echo
+      arguments: ["Hello World!"]
+  - name: "touch /tmp/date.txt"
+    type: "fasthep_flow.operators.BashOperator"
+    kwargs:
+      bash_command: touch
+      arguments: ["/tmp/date.txt"]
diff --git a/docs/examples/index.md b/docs/examples/index.md
@@ -4,6 +4,7 @@ Documentation is useful, but sometimes you just want to see some code. This
 section contains some real-life examples of how to use `fasthep-flow`.
 
 ```{toctree}
+hello_world.md
 atlas_example.md
 cms_pub_example.md
 dune_example.md

diff --git a/docs/index.md b/docs/index.md
@@ -2,20 +2,22 @@
 
 ## Introduction
 
-`fasthep-flow` is a package for converting YAML files into an Apache Airflow
-DAG. It is designed to be used with the [fast-hep](https://fast-hep.github.io/)
-package ecosystem, but can be used independently.
+`fasthep-flow` is a package for converting YAML files into an
+[prefect]$(prefect-link) flow. It is designed to be used with the
+[fast-hep](https://fast-hep.github.io/) package ecosystem, but can be used
+independently.
 
 The goal of this package is to define a workflow, e.g. a HEP analysis, in a YAML
-file, and then convert that YAML file into an Apache Airflow DAG. This DAG can
+file, and then convert that YAML file into an
+[prefect flow](https://docs.prefect.io/latest/concepts/flows/). This flow can
 then be run on a local machine, or on a cluster using
 [CERN's HTCondor](https://batchdocs.web.cern.ch/local/submit.html) (via Dask) or
 [Google Cloud Composer](https://cloud.google.com/composer).
 
 In `fasthep-flow`'s YAML files draws inspiration from Continuous Integration
 (CI) pipelines and Ansible Playbooks to define the workflow, where each step is
 a task that can be run in parallel. `fasthep-flow` will check the parameters of
-each task, and then generate the DAG. The DAG will have a task for each step,
+each task, and then generate the flow. The flow will have a task for each step,
 and the dependencies between the tasks will be defined by the `needs` key in the
 YAML file. More on this under [Configuration](./configuration/index.md).
 
@@ -39,6 +41,7 @@ concepts.md
 configuration/index.md
 operators.md
 executors.md
+provenance.md
 examples/index.md
 advanced_examples/index.md
 command_line_interface.md

diff --git a/docs/provenance.md b/docs/provenance.md
@@ -0,0 +1,49 @@
+# Provenance
+
+## What is Provenance?
+
+Provenance refers to the detailed history of the origin, lineage, and changes
+made to data throughout its lifecycle. It encompasses the documentation of
+processes, inputs, outputs, and transformations that data undergoes, providing a
+comprehensive audit trail that can be used to verify the data's integrity and
+authenticity.
+
+## Why Does Provenance Matter for Scientific Data Analysis?
+
+In scientific data analysis, provenance is crucial as it ensures the
+reproducibility and reliability of results. It enables researchers to trace back
+through the analysis workflow to understand how data was altered, what
+computational steps were performed, and by whom. This traceability is essential
+for validating research findings, facilitating peer reviews, and enabling other
+researchers to replicate and build upon the work.
+
+## Our Approach to Provenance in the YAML Config
+
+To integrate provenance into our workflows, we introduce a dedicated provenance
+section within the YAML configuration. This section describes which metadata
+should be captured, e.g. version of the dataset used, the origin of the data,
+the specific parameters set for each analysis stage, and the individual
+responsible for each step (taken from git history). By embedding this
+information directly into the workflow configuration, we ensure that every step
+of data processing is transparent and traceable. This not only adheres to best
+practices in scientific data handling but also empowers users to conduct robust
+and transparent analyses.
+
+### Example
+
+```yaml
+provenance:
+  datasets:
+    source: fasthep-curator # Specifies the tool used for dataset curation
+  analysis:
+    include:
+      - steps # Enumerates the individual steps taken in the analysis
+      - parameters # Parameters used at each step for reproducibility
+      - git # Git commit hash, branch, and status for version control
+      - performance # Metrics to measure the efficiency of the analysis
+      - environment # Software environment, including library versions
+      - hardware # Hardware specifications where the analysis was run
+  airflow:
+    include:
+      - db # Database configurations and states within Airflow
+```
diff --git a/mypy.ini b/mypy.ini
@@ -1,6 +1,7 @@
 [mypy]
+plugins = pydantic.mypy
 files = ["src", "tests"]
-python_version = 3.8
+python_version = 3.10
 warn_unused_configs = true
 strict = true
 show_error_codes = true
@@ -13,6 +14,8 @@ disallow_untyped_decorators = false
 [mypy-fasthep_flow.*]
 disallow_untyped_defs = true
 disallow_incomplete_defs = true
+implicit_reexport = true
+ignore_missing_imports = true
 
 [mypy-typer.*]
 implicit_reexport = true
@@ -21,3 +24,15 @@ ignore_missing_imports = true
 [mypy-omegaconf.*]
 ignore_missing_imports = true
 implicit_reexport = true
+
+[mypy-plumbum.*]
+ignore_missing_imports = true
+implicit_reexport = true
+
+[mypy-hydra.*]
+ignore_missing_imports = true
+implicit_reexport = true
+
+[mypy-prefect.*]
+ignore_missing_imports = true
+implicit_reexport = true
diff --git a/pyproject.toml b/pyproject.toml
@@ -10,18 +10,16 @@ authors = [
 ]
 description = "Convert YAML into a workflow DAG"
 readme = "README.md"
-requires-python = ">=3.8"
+requires-python = ">=3.10"
 classifiers = [
-  "Development Status :: 1 - Planning",
+  "Development Status :: 2 - Pre-Alpha",
   "Intended Audience :: Science/Research",
   "Intended Audience :: Developers",
   "License :: OSI Approved :: Apache Software License",
   "Operating System :: OS Independent",
   "Programming Language :: Python",
   "Programming Language :: Python :: 3",
   "Programming Language :: Python :: 3 :: Only",
-  "Programming Language :: Python :: 3.8",
-  "Programming Language :: Python :: 3.9",
   "Programming Language :: Python :: 3.10",
   "Programming Language :: Python :: 3.11",
   "Programming Language :: Python :: 3.12",
@@ -30,7 +28,10 @@ classifiers = [
 ]
 dynamic = ["version"]
 dependencies = [
-  "apache-airflow >=2.7",
+  "eval_type_backport",
+  "hydra-core",
+  "plumbum",
+  "prefect",
   "pyyaml >=5.4",
   "omegaconf >=2.1",
   "typer >=0.4",
@@ -45,6 +46,7 @@ test = [
 dev = [
   "pytest >=6",
   "pytest-cov >=3",
+  "ruff",
 ]
 docs = [
   "sphinx>=7.0",
@@ -93,25 +95,6 @@ port.exclude_lines = [
   'if typing.TYPE_CHECKING:',
 ]
 
-# [tool.mypy]
-# files = ["src", "tests"]
-# python_version = "3.8"
-# warn_unused_configs = true
-# strict = true
-# show_error_codes = true
-# enable_error_code = ["ignore-without-code", "redundant-expr", "truthy-bool"]
-# warn_unreachable = true
-# disallow_untyped_defs = false
-# disallow_incomplete_defs = false
-
-# [[tool.mypy.overrides]]
-# module = "fasthep_flow.*"
-# disallow_untyped_defs = true
-# disallow_incomplete_defs = true
-
-# [[tool.mypy-typer.*]]
-# implicit_reexport = true
-
 
 [tool.ruff]
 select = [
@@ -160,7 +143,7 @@ isort.required-imports = ["from __future__ import annotations"]
 
 
 [tool.pylint]
-py-version = "3.8"
+py-version = "3.10"
 ignore-paths = [".*/_version.py"]
 reports.output-format = "colorized"
 similarities.ignore-imports = "yes"

diff --git a/src/fasthep_flow/__init__.py b/src/fasthep_flow/__init__.py
@@ -8,5 +8,7 @@
 from __future__ import annotations
 
 from ._version import version as __version__
+from .config import FlowConfig
+from .workflow import Workflow
 
-__all__ = ("__version__",)
+__all__ = ("__version__", "FlowConfig", "Workflow")