Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CELEBORN-1817] add committed file size metrics #3047

Closed
wants to merge 4 commits into from

Conversation

CodingCat
Copy link
Contributor

What changes were proposed in this pull request?

this PR adds the file size metrics for workers

Why are the changes needed?

the reason for us to add this metric is that we observed that, likely due to the delayed processing of split messages, we have jobs writing 40-50g files even the split threshold is 10g (we use soft split)

we want to have this metrics to monitor the severity of the issue

Does this PR introduce any user-facing change?

yes, one more metrics

How was this patch tested?

(ignore the dashboard title, it's a dummy one)

image

@CodingCat CodingCat changed the title add committed file size metrics [CELEBORN-1817] add committed file size metrics Jan 3, 2025
@SteNicholas
Copy link
Member

@CodingCat, please add the metric in monitoring.md and celeborn-dashboard.json.

@CodingCat
Copy link
Contributor Author

@SteNicholas sure, updated both

"sort": "none"
}
},
"pluginVersion": "11.5.0-80207",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can remove it

{
"datasource": {
"type": "prometheus",
"uid": "grafanacloud-prom"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

 "uid": "${DS_PROMETHEUS}"

"expr": "metrics_PartitionFileSizeBytes_Value{role=\"Worker\", instance=~\"${instance}\"}",
"hide": false,
"instant": false,
"legendFormat": "__auto",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"legendFormat": "${baseLegend}",

"h": 8,
"w": 12,
"x": 12,
"y": 62
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

            "y": 110

@@ -84,6 +84,7 @@ class WorkerSource(conf: CelebornConf) extends AbstractSource(conf, Role.WORKER)
addTimer(CLEAN_EXPIRED_SHUFFLE_KEYS_TIME)

addHistogram(FETCH_CHUNK_TRANSFER_SIZE)
addHistogram(PARTITION_FILE_SIZE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a histogram, you can add the mean, max, 99th, and 95th values into the grafana dashboard.

@CodingCat
Copy link
Contributor Author

thank you @turboFei and @FMX ! updated the PR, would you mind taking another look?

Copy link
Contributor

@FMX FMX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks. Merged into main(v0.6.0).

@FMX FMX closed this in ca60613 Jan 7, 2025
@CodingCat CodingCat deleted the committed_file_size branch January 7, 2025 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants