Spark Operator metrics not showing up #2388

ishaan-mehta · 2025-01-18T00:01:22Z

What happened?

✋ I have searched the open/closed issues and my issue is not listed.

I am trying to scrape the Spark Operator metrics from the metrics endpoint, but I am not seeing any of the metrics that are listed on this page when I visit 8080/metrics: https://www.kubeflow.org/docs/components/spark-operator/getting-started/#enable-metric-exporting-to-prometheus

My controller logs look like this (note that metrics are enabled and that there are no messages indicating that the metrics failed to be registered):

++ id -u
+ uid=185
++ id -g
+ gid=185
+ set +e
++ getent passwd 185
+ uidentry=spark:x:185:185::/home/spark:/bin/sh
+ set -e
+ [[ -z spark:x:185:185::/home/spark:/bin/sh ]]
+ exec /usr/bin/tini -s -- /usr/bin/spark-operator controller start --zap-log-level=info '--namespaces=""' --controller-threads=10 --enable-ui-service=true --enable-metrics=true --metrics-bind-address=:8080 --metrics-endpoint=/metrics --metrics-prefix= --metrics-labels=app_type --leader-election=true --leader-election-lock-name=spark-operator-dev-controller-lock --leader-election-lock-namespace=spark-operator --workqueue-ratelimiter-bucket-qps=50 --workqueue-ratelimiter-bucket-size=500 --workqueue-ratelimiter-max-delay=6h
Spark Operator Version: 2.0.2+HEAD+unknown
Build Date: 2024-10-11T01:46:23+00:00
Git Commit ID:
Git Tree State: clean
Go Version: go1.23.1
Compiler: gc
Platform: linux/amd64
2025-01-17T21:48:56.434Z	[34mINFO[0m	controller/start.go:298	Starting manager
2025-01-17T21:48:56.434Z	[34mINFO[0m	controller-runtime.metrics	server/server.go:205	Starting metrics server
2025-01-17T21:48:56.434Z	[34mINFO[0m	manager/server.go:50	starting server	{"kind": "health probe", "addr": "[::]:8081"}
2025-01-17T21:48:56.434Z	[34mINFO[0m	controller-runtime.metrics	server/server.go:244	Serving metrics server	{"bindAddress": ":8080", "secure": false}
I0117 21:48:56.434810      10 leaderelection.go:250] attempting to acquire leader lease spark-operator/spark-operator-dev-controller-lock...
I0117 21:49:16.255021      10 leaderelection.go:260] successfully acquired lease spark-operator/spark-operator-dev-controller-lock
2025-01-17T21:49:16.255Z	[34mINFO[0m	controller/controller.go:178	Starting EventSource	{"controller": "spark-application-controller", "source": "kind source: *v1.Pod"}
2025-01-17T21:49:16.255Z	[34mINFO[0m	controller/controller.go:178	Starting EventSource	{"controller": "spark-application-controller", "source": "kind source: *v1beta2.SparkApplication"}
2025-01-17T21:49:16.255Z	[34mINFO[0m	controller/controller.go:186	Starting Controller	{"controller": "spark-application-controller"}
2025-01-17T21:49:16.255Z	[34mINFO[0m	controller/controller.go:178	Starting EventSource	{"controller": "scheduled-spark-application-controller", "source": "kind source: *v1beta2.ScheduledSparkApplication"}
2025-01-17T21:49:16.255Z	[34mINFO[0m	controller/controller.go:186	Starting Controller	{"controller": "scheduled-spark-application-controller"}
2025-01-17T21:49:16.356Z	[34mINFO[0m	controller/controller.go:220	Starting workers	{"controller": "spark-application-controller", "worker count": 10}
2025-01-17T21:49:16.356Z	[34mINFO[0m	controller/controller.go:220	Starting workers	{"controller": "scheduled-spark-application-controller", "worker count": 10}

Anyone have any idea why these metrics might be missing? I can see other metrics like controller_runtime_active_workers and workqueue_adds_total.

Reproduction Code

Set up Spark Operator, port-forward metrics port to local machine, and visit localhost:8080/metrics to see all Prometheus metrics that are exposed.

Expected behavior

Should be able to see metrics defined in aforementioned page.

Actual behavior

Cannot see those metrics.

Environment & Versions

Kubernetes Version: 1.29.7
Spark Operator Version: 2.0.2
Apache Spark Version: N/A

Additional context

No response

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

The text was updated successfully, but these errors were encountered:

ishaan-mehta added the kind/bug Something isn't working label Jan 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark Operator metrics not showing up #2388

Spark Operator metrics not showing up #2388

ishaan-mehta commented Jan 18, 2025

Spark Operator metrics not showing up #2388

Spark Operator metrics not showing up #2388

Comments

ishaan-mehta commented Jan 18, 2025

What happened?

Reproduction Code

Expected behavior

Actual behavior

Environment & Versions

Additional context

Impacted by this bug?