-
Notifications
You must be signed in to change notification settings - Fork 291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wrapper around jemalloc to track allocator usage by thread #4336
base: master
Are you sure you want to change the base?
Conversation
0ec36fd
to
5a8f13a
Compare
b2ac2f7
to
6df862a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like it's going to be a big help! just a few comments/questions. Thank you!
"solQuicTpu", | ||
"solQuicTpuFwd", | ||
"solRepairQuic", | ||
"solGossipQuic", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
solGossipQuic
does not exist. gossip is fully on udp these days
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It exists just does not do anything. yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where does it exist? can you point me to it in the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok my bad I have my records wrong. I'll go remove it.
for thread in [ | ||
"solPohTickProd", | ||
"solSigVerTpuVote", | ||
"solRcvrGossip", | ||
"solSigVerTpu", | ||
"solClusterInfo", | ||
"solGossipCons", | ||
"solGossipWork", | ||
"solGossip", | ||
"solRepair", | ||
"FetchStage", | ||
"solShredFetch", | ||
"solReplayTx", | ||
"solReplayFork", | ||
"solRayonGlob", | ||
"solSvrfyShred", | ||
"solSigVerify", | ||
"solRetransmit", | ||
"solRunGossip", | ||
"solWinInsert", | ||
"solAccountsLo", | ||
"solAccounts", | ||
"solAcctHash", | ||
"solVoteSigVerTpu", | ||
"solTrSigVerTpu", | ||
"solQuicClientRt", | ||
"solQuicTVo", | ||
"solQuicTpu", | ||
"solQuicTpuFwd", | ||
"solRepairQuic", | ||
"solGossipQuic", | ||
"solTurbineQuic", | ||
] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'd just need to make sure that these are updated as new thread names get added slash removed. that may be a challenge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You bet! But once thread manager gets merged this can be automated very easily. Feel free to chime in on that effort #3890
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good. just note that solGossipQuic
isn't a named thread in agave.
if !name.starts_with(prefix) { | ||
continue; | ||
} | ||
return Some(stats); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a first match prefix, so threads like solGossip
will also match to solGossipConsume
. don't think that is the ideal behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. Not ideal. But running regexp would be too slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya or could do a longest match prefix. OR if it really doesn't matter then I would add a comment that this is the behavior. we'll just get unpredictable results with a shortest match
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I admit this solution is far from perfect. But I do not have a better idea with similar perf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok so the max prefix length is 16 characters. would that be long enough to not have many issues? since solGossip
won't "start with" solGossipConsume
. so it may not be that big of an issue actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well in current Agave it is plenty long. The prefix notation is a bit brittle, but we can actually address this easily by sorting prefixes by length before adding them into the filter. I'll probably implement that at some point just to be safe.
0e9c9ab
to
73161e0
Compare
73161e0
to
e1e6c16
Compare
a simple wrapper around jemalloc to track allocator usage by thread name in metrics.
Idea is to get a better idea why node crashes when OOM occurs (at least which threads were allocating memory).
This is for dev use only.
Problem
If/when agave starts leaking memory (or just clogging up some channel) it may be tricky to find where memory allocations are happening that cause the crash. Tracking per-pool allocations is not a replacement for valgrind, but has the advantage of fairly small overhead & integration into metrics.
Summary of Changes
Added feature-flag gated custom wrapper around jemalloc that tracks memory usage, grouped by thread pool name.