draft of limiting number of concurrent step function executions #1365
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
FilterExpressions are applied after a Query is evaluated, see https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Query.html#Query.Limit . For example:
We already have a natural limit on the number of concurrent step function executions, as
get_jobs_waiting_for_execution
has to scan through more and more jobs whereexecution_started=True
before it finds 900 whereexecution_started=False
. Eventually the StartExecutionManager lambda hits its 10s timeout while the query is running and no new step function executions are started. Anecdotally, the tipping point is in the hundreds of thousands of executions.This PR makes that limit visible and lower. This change should have no impact on throughput; as in-progress executions complete new executions will be submitted at the same rate; up to our maximum of 900 per minute.
Pros:
As more and more step functions are running concurrently, we see more and more batch.DescribeJob calls made as each of those step functions polls for the status of its AWS Batch job. This generates overhead costs from AWS GuardDuty in JPL/EDC environments. It also leads to
TooManyRequestsException
errors for other individuals/apps making batch.DescribeJob calls. Reducing the number of concurrent executions reduces the impact of these issues.A lower concurrency limit better positions us to pause major processing campaigns and release code fixes if/when needed, since fewer jobs will be "committed" as a running step function execution.
Cons:
This impacts our job priority scheme if/when the concurrency limit is reached. Job priority is implemented at the Batch job level. StartExecution processes jobs independent of priority (I'm not even sure it's first-come first-served). When HyP3 is at the concurrency limit, new high-priority jobs would have to wait in line until StartExecution processes them, at which they'd jump to the front of the 50,000 in-progress executions.
HyP3 production is at little risk of hitting any concurrency limit, since we typically process < 100,000 jobs per month.