Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

draft of limiting number of concurrent step function executions #1365

Closed
wants to merge 3 commits into from

Conversation

asjohnston-asf
Copy link
Member

@asjohnston-asf asjohnston-asf commented Dec 14, 2022

FilterExpressions are applied after a Query is evaluated, see https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Query.html#Query.Limit . For example:

>>> response = table.query(
...     IndexName='status_code',
...     KeyConditionExpression=Key('status_code').eq('SUCCEEDED'),
...     FilterExpression=Attr('execution_started').eq(False),
... )
>>> pprint.pprint(response)
{'Count': 0,
 'Items': [],
 'LastEvaluatedKey': {'job_id': 'ebeb0696-cab2-4cdd-8d80-be6b2815bb80',
                      'status_code': 'SUCCEEDED'},
 'ResponseMetadata': {'HTTPHeaders': {'connection': 'keep-alive',
                                      'content-length': '148',
                                      'content-type': 'application/x-amz-json-1.0',
                                      'date': 'Wed, 14 Dec 2022 21:02:34 GMT',
                                      'server': 'Server',
                                      'x-amz-crc32': '3368818628',
                                      'x-amzn-requestid': 'KSAE99641TFOAUKOIL7PC9LJEJVV4KQNSO5AEMVJF66Q9ASUAAJG'},
                      'HTTPStatusCode': 200,
                      'RequestId': 'KSAE99641TFOAUKOIL7PC9LJEJVV4KQNSO5AEMVJF66Q9ASUAAJG',
                      'RetryAttempts': 0},
 'ScannedCount': 766}

We already have a natural limit on the number of concurrent step function executions, as get_jobs_waiting_for_execution has to scan through more and more jobs where execution_started=True before it finds 900 where execution_started=False. Eventually the StartExecutionManager lambda hits its 10s timeout while the query is running and no new step function executions are started. Anecdotally, the tipping point is in the hundreds of thousands of executions.

This PR makes that limit visible and lower. This change should have no impact on throughput; as in-progress executions complete new executions will be submitted at the same rate; up to our maximum of 900 per minute.

Pros:

As more and more step functions are running concurrently, we see more and more batch.DescribeJob calls made as each of those step functions polls for the status of its AWS Batch job. This generates overhead costs from AWS GuardDuty in JPL/EDC environments. It also leads to TooManyRequestsException errors for other individuals/apps making batch.DescribeJob calls. Reducing the number of concurrent executions reduces the impact of these issues.

A lower concurrency limit better positions us to pause major processing campaigns and release code fixes if/when needed, since fewer jobs will be "committed" as a running step function execution.

Cons:

This impacts our job priority scheme if/when the concurrency limit is reached. Job priority is implemented at the Batch job level. StartExecution processes jobs independent of priority (I'm not even sure it's first-come first-served). When HyP3 is at the concurrency limit, new high-priority jobs would have to wait in line until StartExecution processes them, at which they'd jump to the front of the 50,000 in-progress executions.

HyP3 production is at little risk of hitting any concurrency limit, since we typically process < 100,000 jobs per month.

@asjohnston-asf
Copy link
Member Author

Discussing the implementation details with @jtherrmann, the business logic behind this implementation is very inside-baseball and relies on DynamoDB returning query results in a consistent order. We'll put more thought into considering alternatives before trying to push a change through.

@jtherrmann
Copy link
Contributor

We may want to move the broader discussion to #1272

@jhkennedy jhkennedy deleted the scanned-count branch December 11, 2023 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants