HA Scheduler does not respect max_active_runs in edge case #45388
Labels
affected_version:2.10
Issues Reported for 2.10
area:core
area:Scheduler
including HA (high availability) scheduler
kind:bug
This is a clearly a bug
Apache Airflow version
2.10.4
If "Other Airflow 2 version" selected, which one?
No response
What happened?
Two queued dagruns of a DAG with max_active_runs of 1 started within 0.2 seconds of each other.
The deployment has two schedulers, A and B. I suspect scheduler A started one dagrun and scheduler B started the other dagrun. Because a scheduler queries the active dagrun information every scheduling loop via _start_queued_dagruns, it is possible for the limit to be exceeded as the information is not shared between schedulers. Both schedulers thought there were no active dagruns and started their respective queued dagrun.
The question is how does one dagrun end up in one query and not the other. One explanation could be scheduler A goes out and locks only 1 dagrun row because the other dagrun row is out of the max_dagruns_per_loop_to_schedule range. The other scheduler then picked up the dagrun that did not get queried. Even though it is very unlikely, I suspect both scheduling loops ran the query very closely in time.
What you think should happen instead?
No response
How to reproduce
This scenario requires extreme luck (or lack thereof) so I have not been able to reproduce this behaviour.
Perhaps the key to reproduce this is with max_dagruns_per_loop_to_schedule set to 1.
Operating System
n/a
Versions of Apache Airflow Providers
No response
Deployment
Astronomer
Deployment details
No response
Anything else?
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: