Integrating Airflow into tdp-collections-extra Stack: Open Discussion #85
Unanswered
gonzaloetjo
asked this question in
New components
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Airflow PR Overview:
We're contemplating the inclusion of Airflow into the
extras
collection. Here a general list of components implemented:Scheduler: Responsible for monitoring and ensuring scheduled execution of all tasks. Decides where and when tasks are run.
Webserver: A Flask server that serves the Airflow UI. Helps monitor, trigger and debug DAGs.
Broker: Facilitates communication between the Airflow Scheduler and the Workers, handling task messages and their status updates.
Executor: Responsible for determining how tasks are run, in parallel or sequentially, locally or distributed.
Flower: Used to monitor task progress and history in celery executors.
Workers: Execute the tasks. They pick up and run tasks sent to the queue by the executor.
Database (Metadata DB): Stores metadata about the state of tasks and workflows, and assists in recovery in case of failures.
Dag Directory: Folder of DAG files. It is read by the scheduler and executor, and has to exist in every worker as well.
Current work in the PR
The choice we make should primarily consider our specific needs within the
extra
, along with the stability and community support associated with the options.Discussion Points:
Given Airflow's flexible implementation, this decision presents different discussion points:
Version Selection
There are two potential versions to consider:
Both can be left as an option (through tdp_vars), this is the current implementation. But we consider airflow 2.2.5 to be quite limited in comparison to the newer versions. One of the major motivations for us to integrate later versions were the capability to impersonate the owner of the dag when running tasks using different connectors (for instance Hive tasks).
Architectural Considerations
extras
collection, we have the option of installing these workers either on the workers or on the edge node, where the majority of our clients exist.Multi-Tenancy and Impersonation
As of now, Airflow doesn't inherently support multi-tenancy, with this feature possibly being available in a year or more apache/airflow#29986. Considering this:
bash operator
. However, they can't impersonate other users through Hive, Spark, and HDFS operators, which means we either disallow the use of the BashOperator, or DAGs have to be validated..airflow >= 2.5
.DAG distribution
During airflow installation through the collections, we update DAGs across all relevant hosts. This isn't a suitable solution for dag editors, as tdp is either for installation or, potentially, production maintenance. I don't think there's much for TOSIT-IO to do here, but it's good to be aware of it.
Extra tools
There are quite some extra tooling in the draft PR, most of them mentioned in the previous topics. If we decide to NOT keep them as core to the deployment, we could also keep some of them in
airflow.utils
in case users want to use them. These would be:The decision should consider the specific requirements for the extra, stability, and users support for each option.
Input and suggestions are invited for an effective integration.
Beta Was this translation helpful? Give feedback.
All reactions