-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add/lifecycle heartbeat #1116
base: main
Are you sure you want to change the base?
Add/lifecycle heartbeat #1116
Conversation
Regarding potential improvements: I think caching describeLifecycleHook is unnecessary because like you said termination events are not that frequent and would add unnecessary complexity. Regarding single heartbeat manager, I think it also adds unnecessary complexity Thanks for doing all the thorough manual testing! Can we also test with windows nodes? |
|
||
for { | ||
select { | ||
case <-stopCh: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of stopCh that isn't already covered by timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh actually it makes sense. forgot it's being closed by post drain task
|
||
### Important Notes | ||
|
||
- A lifecycle hook for instance termination is required for this feature. Longer grace periods are achieved by renewing the heartbeat timeout of the ASG's lifecycle hook. Instances terminate instantly without a hook. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this section is important enough that it could be described in the How to use section? The How to use section currently seems sparse.
ASG_TERMINATE_EVENT_ONE_LINE=$(echo "${ASG_TERMINATE_EVENT}" | tr -d '\n' |sed 's/\"/\\"/g') | ||
SEND_SQS_CMD="awslocal sqs send-message --queue-url ${queue_url} --message-body \"${ASG_TERMINATE_EVENT_ONE_LINE}\" --region ${AWS_REGION}" | ||
kubectl exec -i "${localstack_pod}" -- bash -c "$SEND_SQS_CMD" | ||
echo "✅ Sent Spot Interruption Event to SQS queue: ${queue_url}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ASG termination event
if [[ $FOUND_HEARTBEAT_END_LOG -eq 0 ]] && kubectl logs -n kube-system "${NTH_POD}" | grep -q "Heartbeat deadline exceeded, stopping heartbeat"; then | ||
FOUND_HEARTBEAT_END_LOG=1 | ||
fi | ||
if [[ $HEARTBEAT_COUNT -eq 3 && $FOUND_HEARTBEAT_END_LOG -eq 1 ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Instead of hardcoding these values, can we abstract the values of HEARTBEAT_INTERVAL and HEARTBEAT_UNTIL to figure out expected number of Heartbeats at the top of the file? Just to make it easier to update this test file if need be
|
||
if [[ $cordoned -eq 1 && $(kubectl get deployments regular-pod-test -o=jsonpath='{.status.unavailableReplicas}') -eq 1 ]]; then | ||
echo "✅ Verified the regular-pod-test pod was evicted!" | ||
echo "✅ ASG Lifecycle SQS Test Passed $CLUSTER_NAME! ✅" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "ASG Lifecycle SQS Test Passed with Heartbeat"
Just to avoid ambiguity with other tests that we run?
heartbeatTimeout := int(*lifecyclehook.LifecycleHooks[0].HeartbeatTimeout) | ||
|
||
if heartbeatInterval >= heartbeatTimeout { | ||
log.Warn().Msgf("Heartbeat interval (%d seconds) is equal to or greater than the heartbeat timeout (%d seconds) for the lifecycle hook %s. The node would likely be terminated before the heartbeat is sent", heartbeatInterval, heartbeatTimeout, *lifecyclehook.LifecycleHooks[0].LifecycleHookName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also add the ASG name (lifecycleDetail.AutoScalingGroupName) in this log warn? Just to help with debugging?
Issue #, if available:
#493 NTH should issue lifecycle heartbeats
Description of changes:
How you tested your changes:
Environment (Linux / Windows): Linux
Kubernetes Version: v1.31.3
Unit testing:
E2E testing:
Manual testing with real ASG and K8s:
Potential Improvements:
Cached API call for getting heartbeat timeout from ASG (
describeLifecycleHook
).Single heartbeat manager (single gorountine for issuing heartbeats)
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.