Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add/lifecycle heartbeat #1116

Open
wants to merge 19 commits into
base: main
Choose a base branch
from
Open

Conversation

hyeong01
Copy link
Contributor

@hyeong01 hyeong01 commented Jan 10, 2025

Issue #, if available:
#493 NTH should issue lifecycle heartbeats

Description of changes:

  1. Added the option to issue heartbeats
  • Configurable interval and length
  1. Added heartbeat configuration validity check
  2. Added automatic closure when lifecycle action becomes invalid
  3. Added unit and E2E testing
  4. Added explanation of the feature to readme

How you tested your changes:
Environment (Linux / Windows): Linux
Kubernetes Version: v1.31.3

Unit testing:

  • Checked heartbeat signals and closure due to heartbeat expiration
  • Checked heartbeat signals and closure due to drain completion
  • Checked heartbeat failures due to invalid lifecycle hook target
  • Checked heartbeat failures due to not invalid lifecycle hook target

E2E testing:

  • E2E tested heartbeat signal and closure due to heartbeat expiration with kind and localstack.

Manual testing with real ASG and K8s:

  • Tested all 32 possible configuration cases
  • Tested interval > timeout case
  • Graceful termination 5->3, 5->4, 10->3, 103->3 (corresponding number of workers for terminating instances) with varying intervals and length for each.
  • Important example 1: # of terminations=103->3, Interval=300, heartbeatUntil=590, timeout=590. Last interruption event entering the processInterruptionEvent took 223 seconds since the first event was stored to the eventStore. The first heartbeat was sent out 90 seconds after the last interruption event entering.
  • Important example 2: # of terminations=103->3, Interval=90, heartbeatUntil=260, timeout=150. The timeout was not enough for the last heartbeat to be sent out.
  • Ungraceful termination 5->3, 10->3, 20->3, 103->3 (smaller number of workers than terminating instances).

Potential Improvements:
Cached API call for getting heartbeat timeout from ASG (describeLifecycleHook).

  • Details: 1 hour cache for retrieving the heartbeat timeout value and warning user if timeout < interval.
  • Pros: Reduced number of API calls.
  • Cons: Complexity of the system. Cache expires or not used often because termination events happen sparsely.

Single heartbeat manager (single gorountine for issuing heartbeats)

  • Pros: Reduced number of goroutines
  • Cons: Potential burst in API calls. goroutine is light and the number of goroutines does not grow more than the number of workers (default 10).

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@hyeong01 hyeong01 requested a review from a team as a code owner January 10, 2025 03:47
@Lu-David
Copy link
Contributor

Lu-David commented Jan 10, 2025

Regarding potential improvements: I think caching describeLifecycleHook is unnecessary because like you said termination events are not that frequent and would add unnecessary complexity. Regarding single heartbeat manager, I think it also adds unnecessary complexity

Thanks for doing all the thorough manual testing! Can we also test with windows nodes?


for {
select {
case <-stopCh:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of stopCh that isn't already covered by timeout?

Copy link
Contributor

@Lu-David Lu-David Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh actually it makes sense. forgot it's being closed by post drain task


### Important Notes

- A lifecycle hook for instance termination is required for this feature. Longer grace periods are achieved by renewing the heartbeat timeout of the ASG's lifecycle hook. Instances terminate instantly without a hook.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section is important enough that it could be described in the How to use section? The How to use section currently seems sparse.

ASG_TERMINATE_EVENT_ONE_LINE=$(echo "${ASG_TERMINATE_EVENT}" | tr -d '\n' |sed 's/\"/\\"/g')
SEND_SQS_CMD="awslocal sqs send-message --queue-url ${queue_url} --message-body \"${ASG_TERMINATE_EVENT_ONE_LINE}\" --region ${AWS_REGION}"
kubectl exec -i "${localstack_pod}" -- bash -c "$SEND_SQS_CMD"
echo "✅ Sent Spot Interruption Event to SQS queue: ${queue_url}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ASG termination event

if [[ $FOUND_HEARTBEAT_END_LOG -eq 0 ]] && kubectl logs -n kube-system "${NTH_POD}" | grep -q "Heartbeat deadline exceeded, stopping heartbeat"; then
FOUND_HEARTBEAT_END_LOG=1
fi
if [[ $HEARTBEAT_COUNT -eq 3 && $FOUND_HEARTBEAT_END_LOG -eq 1 ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Instead of hardcoding these values, can we abstract the values of HEARTBEAT_INTERVAL and HEARTBEAT_UNTIL to figure out expected number of Heartbeats at the top of the file? Just to make it easier to update this test file if need be


if [[ $cordoned -eq 1 && $(kubectl get deployments regular-pod-test -o=jsonpath='{.status.unavailableReplicas}') -eq 1 ]]; then
echo "✅ Verified the regular-pod-test pod was evicted!"
echo "✅ ASG Lifecycle SQS Test Passed $CLUSTER_NAME! ✅"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "ASG Lifecycle SQS Test Passed with Heartbeat"

Just to avoid ambiguity with other tests that we run?

heartbeatTimeout := int(*lifecyclehook.LifecycleHooks[0].HeartbeatTimeout)

if heartbeatInterval >= heartbeatTimeout {
log.Warn().Msgf("Heartbeat interval (%d seconds) is equal to or greater than the heartbeat timeout (%d seconds) for the lifecycle hook %s. The node would likely be terminated before the heartbeat is sent", heartbeatInterval, heartbeatTimeout, *lifecyclehook.LifecycleHooks[0].LifecycleHookName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also add the ASG name (lifecycleDetail.AutoScalingGroupName) in this log warn? Just to help with debugging?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants