HA Scheduler does not respect max_active_runs in edge case #45388

wolfier · 2025-01-03T23:44:10Z

Apache Airflow version

2.10.4

If "Other Airflow 2 version" selected, which one?

No response

What happened?

Two queued dagruns of a DAG with max_active_runs of 1 started within 0.2 seconds of each other.

The deployment has two schedulers, A and B. I suspect scheduler A started one dagrun and scheduler B started the other dagrun. Because a scheduler queries the active dagrun information every scheduling loop via _start_queued_dagruns, it is possible for the limit to be exceeded as the information is not shared between schedulers. Both schedulers thought there were no active dagruns and started their respective queued dagrun.

The question is how does one dagrun end up in one query and not the other. One explanation could be scheduler A goes out and locks only 1 dagrun row because the other dagrun row is out of the max_dagruns_per_loop_to_schedule range. The other scheduler then picked up the dagrun that did not get queried. Even though it is very unlikely, I suspect both scheduling loops ran the query very closely in time.

What you think should happen instead?

No response

How to reproduce

This scenario requires extreme luck (or lack thereof) so I have not been able to reproduce this behaviour.

Perhaps the key to reproduce this is with max_dagruns_per_loop_to_schedule set to 1.

Operating System

n/a

Versions of Apache Airflow Providers

No response

Deployment

Astronomer

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

potiuk · 2025-01-04T13:01:36Z

Yes. That is quite possible edge case though rather infrequent. I am not sure if there is a way we could protect it - we would likely have to add lock on dag not only on DagRun, but it would likely heavily decrease some parallelism scenarios for scheduling. @ashb wdyt ?

jedcunningham · 2025-01-04T20:46:32Z

As I was chatting with Alan about this yesterday, another thought I had was to do optimistic locking. e.g. set the state to running only if the count of running for that dag matches what we expect (and used to determine it's okay to start another!).

potiuk · 2025-01-04T21:03:29Z

Can we do it ? How?

I think this is the "logical" problem to solve.

Whatever optimisitc query you can run is "read" and in order to make it really protected, you need to make it part of a transaction - you need to query and update the state in the same transaction essentially.

Or you have to read a "state" of the system and use it in the "write" statement to detect that the state has changed. You cannot really separate "read" and "update" action - they should be connected by either transaction or state retrieved from the "read" action has to be used as a "potentially failing" check in the update state.

potiuk · 2025-01-04T21:07:32Z

Generally:

a) pessimistic locking

lock update
run query and update state
save the state change and unlocki

b) optimistic locking

read a state
run query and only update if the state has not changed in a single transaction and commit in the same operation (and fail otherwise)

I am not sure what "state" we can read here and how we can verify the state has not changed since we read it.

wolfier added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Jan 3, 2025

dosubot bot added the area:Scheduler including HA (high availability) scheduler label Jan 3, 2025

eladkal added affected_version:2.10 Issues Reported for 2.10 and removed needs-triage label for new issues that we didn't triage yet labels Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HA Scheduler does not respect max_active_runs in edge case #45388

HA Scheduler does not respect max_active_runs in edge case #45388

wolfier commented Jan 3, 2025

potiuk commented Jan 4, 2025

jedcunningham commented Jan 4, 2025

potiuk commented Jan 4, 2025 •

edited

Loading

potiuk commented Jan 4, 2025 •

edited

Loading

HA Scheduler does not respect max_active_runs in edge case #45388

HA Scheduler does not respect max_active_runs in edge case #45388

Comments

wolfier commented Jan 3, 2025

Apache Airflow version

If "Other Airflow 2 version" selected, which one?

What happened?

What you think should happen instead?

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else?

Are you willing to submit PR?

Code of Conduct

potiuk commented Jan 4, 2025

jedcunningham commented Jan 4, 2025

potiuk commented Jan 4, 2025 • edited Loading

potiuk commented Jan 4, 2025 • edited Loading

potiuk commented Jan 4, 2025 •

edited

Loading

potiuk commented Jan 4, 2025 •

edited

Loading