-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix potential unbounded LLO transmit queue #16166
Conversation
0f4bb4e
to
80ca5cf
Compare
AER Report: CI Core ran successfully ✅AER Report: Operator UI CI ran successfully ✅ |
ed8eb21
to
4ddee6c
Compare
We have observed unbounded queue growth in production. Not sure of the cause. This PR implements a swathe of measures intended to make queue management more reliable and safe. - Reduces required number of DB connections and reduce overall DB transaction load - Remove duplicate deletion code from server.go and manage everything in the persistence manager - Introduce an application-wide global reaper for last-ditch cleanup effort - Implement delete batching for more reliable and incremental deletion
4ddee6c
to
5376160
Compare
Flakeguard SummaryRan new or updated tests between View Flaky Detector Details | Compare Changes Found Flaky Tests ❌
ArtifactsFor detailed logs of the failed tests, please refer to the artifact failed-test-results-with-logs.json. |
We have observed unbounded queue growth in production. Not sure of the cause. This PR implements a swathe of measures intended to make queue management more reliable and safe. - Reduces required number of DB connections and reduce overall DB transaction load - Remove duplicate deletion code from server.go and manage everything in the persistence manager - Introduce an application-wide global reaper for last-ditch cleanup effort - Implement delete batching for more reliable and incremental deletion
5376160
to
9bec7e0
Compare
Flakeguard SummaryRan new or updated tests between View Flaky Detector Details | Compare Changes Found Flaky Tests ❌
ArtifactsFor detailed logs of the failed tests, please refer to the artifact failed-test-results-with-logs.json. |
We have observed unbounded queue growth in production. Not sure of the cause. This PR implements a swathe of measures intended to make queue management more reliable and safe. - Reduces required number of DB connections and reduce overall DB transaction load - Remove duplicate deletion code from server.go and manage everything in the persistence manager - Introduce an application-wide global reaper for last-ditch cleanup effort - Implement delete batching for more reliable and incremental deletion
} else { | ||
clients := make(map[string]grpc.Client) | ||
for _, server := range lloCfg.GetServers() { | ||
var client grpc.Client | ||
switch r.mercuryCfg.Transmitter().Protocol() { | ||
case config.MercuryTransmitterProtocolGRPC: | ||
client = grpc.NewClient(grpc.ClientOpts{ | ||
Logger: r.lggr, | ||
Logger: lggr.Named(server.URL), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What sort of url? Will it read OK given the names are dot separated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like e.g. example.com/ws
or 192.0.2.1:2345/foo/bar
The only differentiating factor between servers is the URL, and the logger name needs to be globally unique, so this was the only way I could think of to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the logger name needs to be unique in order to report health independently, but for the purposes of just logging there is no technical conflict with sharing the log - what if we added it as a key/val? Would that be clear enough?
Logger: lggr.Named(server.URL), | |
Logger: lggr.With("url", server.URL), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logger: lggr.Named(server.URL), | |
Logger: lggr.Named(fmt.Sprintf("%q", server.URL)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do this in a followup
Bugfixes and improvements intended to make queue management more reliable and safe. - Reduces required number of DB connections and reduce overall DB transaction load - Fixes potential for unbounded queue growth - Fixes possibility of OOM trying to load too many records on boot - Remove duplicate deletion code from server.go and manage everything in the persistence manager - Introduce an application-wide global reaper for last-ditch cleanup effort - Implement delete batching for more reliable and incremental deletion - Ensure that records are properly removed on exit
d73a39c
to
1222cbb
Compare
|
continue | ||
} | ||
} else { | ||
t.lggr.Debugw("Reaped stale transmissions", "nDeleted", n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing break? If we get here, we're just spinning in the for loop and spamming logs like crazy :)
Also the Reaper gets enabled by default on unrelated DONs.
We have observed unbounded queue growth in production. This PR implements a swathe of bugfixes and improvements intended to make queue management more reliable and safe.