Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HA Scheduler does not respect max_active_runs in edge case #45388

Open
1 of 2 tasks
wolfier opened this issue Jan 3, 2025 · 4 comments
Open
1 of 2 tasks

HA Scheduler does not respect max_active_runs in edge case #45388

wolfier opened this issue Jan 3, 2025 · 4 comments
Labels
affected_version:2.10 Issues Reported for 2.10 area:core area:Scheduler including HA (high availability) scheduler kind:bug This is a clearly a bug

Comments

@wolfier
Copy link
Contributor

wolfier commented Jan 3, 2025

Apache Airflow version

2.10.4

If "Other Airflow 2 version" selected, which one?

No response

What happened?

Two queued dagruns of a DAG with max_active_runs of 1 started within 0.2 seconds of each other.

The deployment has two schedulers, A and B. I suspect scheduler A started one dagrun and scheduler B started the other dagrun. Because a scheduler queries the active dagrun information every scheduling loop via _start_queued_dagruns, it is possible for the limit to be exceeded as the information is not shared between schedulers. Both schedulers thought there were no active dagruns and started their respective queued dagrun.

The question is how does one dagrun end up in one query and not the other. One explanation could be scheduler A goes out and locks only 1 dagrun row because the other dagrun row is out of the max_dagruns_per_loop_to_schedule range. The other scheduler then picked up the dagrun that did not get queried. Even though it is very unlikely, I suspect both scheduling loops ran the query very closely in time.

What you think should happen instead?

No response

How to reproduce

This scenario requires extreme luck (or lack thereof) so I have not been able to reproduce this behaviour.

Perhaps the key to reproduce this is with max_dagruns_per_loop_to_schedule set to 1.

Operating System

n/a

Versions of Apache Airflow Providers

No response

Deployment

Astronomer

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@wolfier wolfier added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Jan 3, 2025
@dosubot dosubot bot added the area:Scheduler including HA (high availability) scheduler label Jan 3, 2025
@potiuk
Copy link
Member

potiuk commented Jan 4, 2025

Yes. That is quite possible edge case though rather infrequent. I am not sure if there is a way we could protect it - we would likely have to add lock on dag not only on DagRun, but it would likely heavily decrease some parallelism scenarios for scheduling. @ashb wdyt ?

@jedcunningham
Copy link
Member

As I was chatting with Alan about this yesterday, another thought I had was to do optimistic locking. e.g. set the state to running only if the count of running for that dag matches what we expect (and used to determine it's okay to start another!).

@potiuk
Copy link
Member

potiuk commented Jan 4, 2025

Can we do it ? How?

I think this is the "logical" problem to solve.

Whatever optimisitc query you can run is "read" and in order to make it really protected, you need to make it part of a transaction - you need to query and update the state in the same transaction essentially.

Or you have to read a "state" of the system and use it in the "write" statement to detect that the state has changed. You cannot really separate "read" and "update" action - they should be connected by either transaction or state retrieved from the "read" action has to be used as a "potentially failing" check in the update state.

@potiuk
Copy link
Member

potiuk commented Jan 4, 2025

Generally:

a) pessimistic locking

  • lock update
  • run query and update state
  • save the state change and unlocki

b) optimistic locking

  • read a state
  • run query and only update if the state has not changed in a single transaction and commit in the same operation (and fail otherwise)

I am not sure what "state" we can read here and how we can verify the state has not changed since we read it.

@eladkal eladkal added affected_version:2.10 Issues Reported for 2.10 and removed needs-triage label for new issues that we didn't triage yet labels Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affected_version:2.10 Issues Reported for 2.10 area:core area:Scheduler including HA (high availability) scheduler kind:bug This is a clearly a bug
Projects
None yet
Development

No branches or pull requests

4 participants