Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use HTTP liveness probe instead of tetra status #2474

Closed
wants to merge 2 commits into from

Conversation

tpapagian
Copy link
Member

@tpapagian tpapagian commented May 27, 2024

Please ensure your pull request adheres to the following guidelines:

  • For first time contributors, read Submitting a pull request.
  • All code is covered by unit and/or end-to-end tests tests where feasible.
  • All commits contain a well written commit message including a title,
    description and a Fixes: #XXX line if the commit addresses a particular
    GitHub issue.
  • All commits are signed off. See the section Developer’s Certificate of Origin
  • Provide a title or release-note blurb suitable for the release notes.
  • Are you a user of Tetragon? Please add yourself to the Users doc in the Cilium repository.
  • Thanks for contributing!

Now, we use tetra status command to report the status of tetragon agent. This comes with some overheads as tetra binary has a lot of additional functionality and it seems like an overkill to use that for status reporting.

On the other hand, k8s supports liveness probes by using an HTTP endpoint (i.e. https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#http-probes). This patch first creates a new HTTP endpoint to report agent status and then change helm to make use of that.

Use HTTP liveness probe instead of tetra status.

@tpapagian tpapagian force-pushed the pr/apapag/http-liveness-probe branch from bd2eb34 to 4ae2b3d Compare May 27, 2024 10:09
@tpapagian tpapagian added the release-note/misc This PR makes changes that have no direct user impact. label May 27, 2024
@tpapagian tpapagian force-pushed the pr/apapag/http-liveness-probe branch 2 times, most recently from d6c18d1 to 56363a8 Compare May 27, 2024 10:29
@tpapagian tpapagian changed the title Test Use HTTP liveness probe instead of tetra status May 27, 2024
@tpapagian tpapagian marked this pull request as ready for review May 27, 2024 11:08
@tpapagian tpapagian requested a review from a team as a code owner May 27, 2024 11:08
@tpapagian tpapagian requested a review from jrfastab May 27, 2024 11:08
Now, we use tetra status command to report the status of tetragon
agent. This comes with some overheads as tetra binary has a lot of
additional functionality and it seems like an overkill to use that for
status reporting.

On the other hand, k8s supports liveness probes by using an HTTP
endpoint (i.e.
https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#http-probes).
This patch first creates a new HTTP endpoint to report agent status that
can be used for the liveness probe.

Signed-off-by: Anastasios Papagiannis <[email protected]>
@tpapagian tpapagian force-pushed the pr/apapag/http-liveness-probe branch from 56363a8 to f3c11ed Compare May 27, 2024 11:44
Copy link

netlify bot commented May 27, 2024

Deploy Preview for tetragon ready!

Name Link
🔨 Latest commit d9bd92c
🔍 Latest deploy log https://app.netlify.com/sites/tetragon/deploys/6655a0fd9b19f20009e73930
😎 Deploy Preview https://deploy-preview-2474--tetragon.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

# -- Use http-based livenessProbe for the tetragon container. livenessProbe has a higher priority.
livenessHttpProbe:
enabled: true
# host: "localhost"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would better to have this under livenessProbe somehow.

CC: @michi-covalent @lambdanis

Copy link
Member Author

@tpapagian tpapagian May 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could do something like:

{{- if .Values.tetragon.livenessProbe.enabled }}
  livenessProbe:
    timeoutSeconds: 60
{{- if .Values.tetragon.livenessProbe.custom }}  
  {{- toYaml .Values.tetragon.livenessProbe.custom | nindent 4 }}
{{- else if .Values.tetragon.livenessProbe.http.enabled }}
     httpGet:
       path: /liveness
       port: 6789
{{- if .Values.tetragon.livenessProbe.http.host }}
       host: {{ .Values.tetragon.livenessProbe.http.host }}
{{- end }}
{{- else if .Values.tetragon.grpc.enabled }}
     exec:
       command:
       - tetra
       - status
       - --server-address
       - {{ .Values.tetragon.grpc.address }}
       - --retries
       - "5"
{{- end -}}
{{- end -}}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using that we can have:


livenessProbe:
  enabled: false

to disable liveness probe.


livenessProbe:
  enabled: true
  custom:
    grpc:
      port: 54321

to define a custom liveness probe.


livenessProbe:
  enabled: true
  http:
    enabled: true

to define an http-based liveness probe.


livenessProbe:
  enabled: true

to define our legacy "tetra status"-based liveness probe (assuming tetragon.grpc.enabled is true).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something, but can we just change the default livenessProbe and keep everything else the same? That is, change the default values to:

livenessProbe:
  timeoutSeconds: 60
  httpGet:
    path: /liveness
    port: 6789

and keep _container_tetragon.tpl unchanged. Then users can still use the tetra probe by setting livenessProbe: {}.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this makes sense (easier compared to what I am trying to do). But what I am not able to achieve is setting livenessProbe: {}. Possibly I am missing something.

First, change the existing values.yaml with:

@@ -64,9 +64,11 @@ tetragon:
   securityContext:
     privileged: true
   # -- Overrides the default livenessProbe for the tetragon container.
-  livenessProbe: {}
-  #  grpc:
-  #    port: 54321
+  livenessProbe:
+    timeoutSeconds: 60
+    httpGet:
+      path: "/liveness"
+      port: 6789

   # Tetragon puts processes in an LRU cache. The cache is used to find ancestors
   # for subsequently exec'ed processes.

Then I run:

$ cat values.yaml
tetragon:
  livenessProbe: {}
$ helm template ./install/kubernetes/tetragon -f ./values.yaml
[...]
        livenessProbe:
          httpGet:
            path: /liveness
            port: 6789
          timeoutSeconds: 60
[...]

Which seems to be the same.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I think I found a way to do that. Updating the PR now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should define

$ cat values.yaml
tetragon:
  livenessProbe: null

instead of

$ cat values.yaml
tetragon:
  livenessProbe: {}

to use the old tetra-based liveness probe.

PR is ready for review.

The previous commit introduced an HTTP endopoint that can be used for
the liveness probe. This patch changes helm to make that default instead
of the tetra status based liveness probe.

The user can still use the tetra status based liveness probe by
defining a values file similar to:
$ cat values.yaml
tetragon:
  livenessProbe: null

Signed-off-by: Anastasios Papagiannis <[email protected]>
@tpapagian tpapagian force-pushed the pr/apapag/http-liveness-probe branch from f3c11ed to d9bd92c Compare May 27, 2024 17:33
@michi-covalent michi-covalent self-requested a review May 27, 2024 17:49
@michi-covalent
Copy link
Contributor

i have my opinion about this. i'll add my review here tomorrow 🚀 🙏

Copy link
Contributor

@lambdanis lambdanis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the gRPC server is disabled then this new healthcheck will always fail, right? I think it shouldn't.

I don't have context on why the GetHealth endpoint was introduced, but it looks a bit awkward to me. gRPC itself defines the health checking service, so I would use the upstream proto and start the health service alongside the regular Tetragon gRPC server.

Also, Kubernetes now supports gRPC liveness probes, so unless we want to support old Kubernetes versions, we can use them instead of starting another HTTP server.

@tpapagian
Copy link
Member Author

If the gRPC server is disabled then this new healthcheck will always fail, right? I think it shouldn't.

Not sure why with gRPC server disabled the new check will fail. How is that related?

I don't have context on why the GetHealth endpoint was introduced, but it looks a bit awkward to me. gRPC itself defines the health checking service, so I would use the upstream proto and start the health service alongside the regular Tetragon gRPC server.

Also, Kubernetes now supports gRPC liveness probes, so unless we want to support old Kubernetes versions, we can use them instead of starting another HTTP server.

I also don't have context about the details of GetHealth function that we use. But I believe that this is not directly related to this PR. The gRPC liveness probe in k8s is stable in 1.27. The purpose of this PR is to remove the need to have a liveness probe that is related to tetra status and use http-based liveness probe that also supports older k8s versions. We can also implement and use grpc-based liveness probe on systems that supports that, but this can be also done in a followup PR.

@lambdanis
Copy link
Contributor

Not sure why with gRPC server disabled the new check will fail. How is that related?

Ok, I actually checked GetHealth implementation - it always says it's fine. So you're right, it won't fail, it's just a confusing implementation. I think it's fine to follow-up with the new health checks, just wanted to check I'm not missing something.

@tpapagian tpapagian force-pushed the pr/apapag/http-liveness-probe branch 4 times, most recently from d47e9c9 to d9bd92c Compare May 28, 2024 09:16
@tpapagian
Copy link
Member Author

Closing in favor of #2478 as we choose to proceed with a gRPC-based liveness probe.

@tpapagian tpapagian closed this May 28, 2024
@tpapagian tpapagian deleted the pr/apapag/http-liveness-probe branch May 28, 2024 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants