Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark Operator metrics not showing up #2388

Open
1 task done
ishaan-mehta opened this issue Jan 18, 2025 · 0 comments
Open
1 task done

Spark Operator metrics not showing up #2388

ishaan-mehta opened this issue Jan 18, 2025 · 0 comments
Labels
kind/bug Something isn't working

Comments

@ishaan-mehta
Copy link

What happened?

  • ✋ I have searched the open/closed issues and my issue is not listed.

I am trying to scrape the Spark Operator metrics from the metrics endpoint, but I am not seeing any of the metrics that are listed on this page when I visit 8080/metrics: https://www.kubeflow.org/docs/components/spark-operator/getting-started/#enable-metric-exporting-to-prometheus

My controller logs look like this (note that metrics are enabled and that there are no messages indicating that the metrics failed to be registered):

++ id -u
+ uid=185
++ id -g
+ gid=185
+ set +e
++ getent passwd 185
+ uidentry=spark:x:185:185::/home/spark:/bin/sh
+ set -e
+ [[ -z spark:x:185:185::/home/spark:/bin/sh ]]
+ exec /usr/bin/tini -s -- /usr/bin/spark-operator controller start --zap-log-level=info '--namespaces=""' --controller-threads=10 --enable-ui-service=true --enable-metrics=true --metrics-bind-address=:8080 --metrics-endpoint=/metrics --metrics-prefix= --metrics-labels=app_type --leader-election=true --leader-election-lock-name=spark-operator-dev-controller-lock --leader-election-lock-namespace=spark-operator --workqueue-ratelimiter-bucket-qps=50 --workqueue-ratelimiter-bucket-size=500 --workqueue-ratelimiter-max-delay=6h
Spark Operator Version: 2.0.2+HEAD+unknown
Build Date: 2024-10-11T01:46:23+00:00
Git Commit ID:
Git Tree State: clean
Go Version: go1.23.1
Compiler: gc
Platform: linux/amd64
2025-01-17T21:48:56.434Z	[34mINFO[0m	controller/start.go:298	Starting manager
2025-01-17T21:48:56.434Z	[34mINFO[0m	controller-runtime.metrics	server/server.go:205	Starting metrics server
2025-01-17T21:48:56.434Z	[34mINFO[0m	manager/server.go:50	starting server	{"kind": "health probe", "addr": "[::]:8081"}
2025-01-17T21:48:56.434Z	[34mINFO[0m	controller-runtime.metrics	server/server.go:244	Serving metrics server	{"bindAddress": ":8080", "secure": false}
I0117 21:48:56.434810      10 leaderelection.go:250] attempting to acquire leader lease spark-operator/spark-operator-dev-controller-lock...
I0117 21:49:16.255021      10 leaderelection.go:260] successfully acquired lease spark-operator/spark-operator-dev-controller-lock
2025-01-17T21:49:16.255Z	[34mINFO[0m	controller/controller.go:178	Starting EventSource	{"controller": "spark-application-controller", "source": "kind source: *v1.Pod"}
2025-01-17T21:49:16.255Z	[34mINFO[0m	controller/controller.go:178	Starting EventSource	{"controller": "spark-application-controller", "source": "kind source: *v1beta2.SparkApplication"}
2025-01-17T21:49:16.255Z	[34mINFO[0m	controller/controller.go:186	Starting Controller	{"controller": "spark-application-controller"}
2025-01-17T21:49:16.255Z	[34mINFO[0m	controller/controller.go:178	Starting EventSource	{"controller": "scheduled-spark-application-controller", "source": "kind source: *v1beta2.ScheduledSparkApplication"}
2025-01-17T21:49:16.255Z	[34mINFO[0m	controller/controller.go:186	Starting Controller	{"controller": "scheduled-spark-application-controller"}
2025-01-17T21:49:16.356Z	[34mINFO[0m	controller/controller.go:220	Starting workers	{"controller": "spark-application-controller", "worker count": 10}
2025-01-17T21:49:16.356Z	[34mINFO[0m	controller/controller.go:220	Starting workers	{"controller": "scheduled-spark-application-controller", "worker count": 10}

Anyone have any idea why these metrics might be missing? I can see other metrics like controller_runtime_active_workers and workqueue_adds_total.

Reproduction Code

Set up Spark Operator, port-forward metrics port to local machine, and visit localhost:8080/metrics to see all Prometheus metrics that are exposed.

Expected behavior

Should be able to see metrics defined in aforementioned page.

Actual behavior

Cannot see those metrics.

Environment & Versions

  • Kubernetes Version: 1.29.7
  • Spark Operator Version: 2.0.2
  • Apache Spark Version: N/A

Additional context

No response

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

@ishaan-mehta ishaan-mehta added the kind/bug Something isn't working label Jan 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant