Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wrapper around jemalloc to track allocator usage by thread #4336

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

alexpyattaev
Copy link

a simple wrapper around jemalloc to track allocator usage by thread name in metrics.
Idea is to get a better idea why node crashes when OOM occurs (at least which threads were allocating memory).
This is for dev use only.

Problem

If/when agave starts leaking memory (or just clogging up some channel) it may be tricky to find where memory allocations are happening that cause the crash. Tracking per-pool allocations is not a replacement for valgrind, but has the advantage of fairly small overhead & integration into metrics.

Summary of Changes

Added feature-flag gated custom wrapper around jemalloc that tracks memory usage, grouped by thread pool name.

@alexpyattaev alexpyattaev force-pushed the memory_metrics branch 3 times, most recently from 0ec36fd to 5a8f13a Compare January 7, 2025 22:57
@alexpyattaev alexpyattaev force-pushed the memory_metrics branch 3 times, most recently from b2ac2f7 to 6df862a Compare January 8, 2025 22:59
Copy link

@gregcusack gregcusack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like it's going to be a big help! just a few comments/questions. Thank you!

"solQuicTpu",
"solQuicTpuFwd",
"solRepairQuic",
"solGossipQuic",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

solGossipQuic does not exist. gossip is fully on udp these days

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It exists just does not do anything. yet.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does it exist? can you point me to it in the code?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok my bad I have my records wrong. I'll go remove it.

memory-management/src/jemalloc_monitor.rs Show resolved Hide resolved
Comment on lines 57 to 88
for thread in [
"solPohTickProd",
"solSigVerTpuVote",
"solRcvrGossip",
"solSigVerTpu",
"solClusterInfo",
"solGossipCons",
"solGossipWork",
"solGossip",
"solRepair",
"FetchStage",
"solShredFetch",
"solReplayTx",
"solReplayFork",
"solRayonGlob",
"solSvrfyShred",
"solSigVerify",
"solRetransmit",
"solRunGossip",
"solWinInsert",
"solAccountsLo",
"solAccounts",
"solAcctHash",
"solVoteSigVerTpu",
"solTrSigVerTpu",
"solQuicClientRt",
"solQuicTVo",
"solQuicTpu",
"solQuicTpuFwd",
"solRepairQuic",
"solGossipQuic",
"solTurbineQuic",
] {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'd just need to make sure that these are updated as new thread names get added slash removed. that may be a challenge

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You bet! But once thread manager gets merged this can be automated very easily. Feel free to chime in on that effort #3890

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good. just note that solGossipQuic isn't a named thread in agave.

memory-management/src/jemalloc_monitor.rs Show resolved Hide resolved
Comment on lines +205 to +211
if !name.starts_with(prefix) {
continue;
}
return Some(stats);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a first match prefix, so threads like solGossip will also match to solGossipConsume. don't think that is the ideal behavior.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. Not ideal. But running regexp would be too slow.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya or could do a longest match prefix. OR if it really doesn't matter then I would add a comment that this is the behavior. we'll just get unpredictable results with a shortest match

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit this solution is far from perfect. But I do not have a better idea with similar perf.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so the max prefix length is 16 characters. would that be long enough to not have many issues? since solGossip won't "start with" solGossipConsume. so it may not be that big of an issue actually.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well in current Agave it is plenty long. The prefix notation is a bit brittle, but we can actually address this easily by sorting prefixes by length before adding them into the filter. I'll probably implement that at some point just to be safe.

@alexpyattaev alexpyattaev marked this pull request as ready for review January 17, 2025 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants