Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix potential unbounded LLO transmit queue #16166

Merged
merged 2 commits into from
Feb 5, 2025
Merged

Conversation

samsondav
Copy link
Collaborator

@samsondav samsondav commented Jan 31, 2025

We have observed unbounded queue growth in production. This PR implements a swathe of bugfixes and improvements intended to make queue management more reliable and safe.

  • Reduces required number of DB connections and reduce overall DB transaction load
  • Remove duplicate deletion code from server.go and manage everything in the persistence manager
  • Introduce an application-wide global reaper for last-ditch cleanup effort
  • Implement delete batching for more reliable and incremental deletion

@samsondav samsondav requested review from a team as code owners January 31, 2025 12:48
ro-tex
ro-tex previously approved these changes Jan 31, 2025
@samsondav samsondav enabled auto-merge January 31, 2025 13:38
@samsondav samsondav changed the title Initial set to populate gauge metrics Optimize LLO transmitter queue deletes using batching Jan 31, 2025
@samsondav samsondav force-pushed the set_initial_metrics branch 7 times, most recently from 0f4bb4e to 80ca5cf Compare January 31, 2025 16:24
Copy link
Contributor

github-actions bot commented Jan 31, 2025

AER Report: CI Core ran successfully ✅

aer_workflow , commit

AER Report: Operator UI CI ran successfully ✅

aer_workflow , commit

@samsondav samsondav requested review from a team as code owners February 3, 2025 13:57
@samsondav samsondav requested a review from krehermann February 3, 2025 13:57
@samsondav samsondav force-pushed the set_initial_metrics branch from ed8eb21 to 4ddee6c Compare February 3, 2025 13:58
@samsondav samsondav changed the title Optimize LLO transmitter queue deletes using batching Refactor LLO transmission queue cleanup Feb 3, 2025
samsondav added a commit that referenced this pull request Feb 3, 2025
We have observed unbounded queue growth in production. Not sure of the cause. This PR implements a swathe of measures intended to make queue management more reliable and safe.

- Reduces required number of DB connections and reduce overall DB transaction load
- Remove duplicate deletion code from server.go and manage everything in the persistence manager
- Introduce an application-wide global reaper for last-ditch cleanup effort
- Implement delete batching for more reliable and incremental deletion
@samsondav samsondav force-pushed the set_initial_metrics branch from 4ddee6c to 5376160 Compare February 3, 2025 14:01
Copy link
Contributor

github-actions bot commented Feb 3, 2025

Flakeguard Summary

Ran new or updated tests between develop and 5376160 (set_initial_metrics).

View Flaky Detector Details | Compare Changes

Found Flaky Tests ❌

Name Pass Ratio Panicked? Timed Out? Race? Runs Successes Failures Skips Package Package Panicked? Avg Duration Code Owners
TestConfig_Marshal 0.00% false false false 3 0 3 0 github.com/smartcontractkit/chainlink/v2/core/services/chainlink false 40ms Unknown
TestConfig_Marshal/Mercury 0.00% false false false 3 0 3 0 github.com/smartcontractkit/chainlink/v2/core/services/chainlink false 0s Unknown
TestConfig_Marshal/full 0.00% false false false 3 0 3 0 github.com/smartcontractkit/chainlink/v2/core/services/chainlink false 10ms Unknown
TestConfig_full 0.00% false false false 3 0 3 0 github.com/smartcontractkit/chainlink/v2/core/services/chainlink false 10ms Unknown
Test_generalConfig_LogConfiguration 0.00% false false false 3 0 3 0 github.com/smartcontractkit/chainlink/v2/core/services/chainlink false 20ms Unknown
Test_generalConfig_LogConfiguration/empty 0.00% false false false 3 0 3 0 github.com/smartcontractkit/chainlink/v2/core/services/chainlink false 0s Unknown
Test_generalConfig_LogConfiguration/full 0.00% false false false 3 0 3 0 github.com/smartcontractkit/chainlink/v2/core/services/chainlink false 10ms Unknown
Test_generalConfig_LogConfiguration/multi-chain 0.00% false false false 3 0 3 0 github.com/smartcontractkit/chainlink/v2/core/services/chainlink false 10ms Unknown

Artifacts

For detailed logs of the failed tests, please refer to the artifact failed-test-results-with-logs.json.

samsondav added a commit that referenced this pull request Feb 3, 2025
We have observed unbounded queue growth in production. Not sure of the cause. This PR implements a swathe of measures intended to make queue management more reliable and safe.

- Reduces required number of DB connections and reduce overall DB transaction load
- Remove duplicate deletion code from server.go and manage everything in the persistence manager
- Introduce an application-wide global reaper for last-ditch cleanup effort
- Implement delete batching for more reliable and incremental deletion
@samsondav samsondav force-pushed the set_initial_metrics branch from 5376160 to 9bec7e0 Compare February 3, 2025 14:20
Copy link
Contributor

github-actions bot commented Feb 3, 2025

Flakeguard Summary

Ran new or updated tests between develop and 9bec7e0 (set_initial_metrics).

View Flaky Detector Details | Compare Changes

Found Flaky Tests ❌

Name Pass Ratio Panicked? Timed Out? Race? Runs Successes Failures Skips Package Package Panicked? Avg Duration Code Owners
TestConfig_Marshal 0.00% false false false 3 0 3 0 github.com/smartcontractkit/chainlink/v2/core/services/chainlink false 40ms Unknown
TestConfig_Marshal/Mercury 0.00% false false false 3 0 3 0 github.com/smartcontractkit/chainlink/v2/core/services/chainlink false 0s Unknown
TestConfig_Marshal/full 0.00% false false false 3 0 3 0 github.com/smartcontractkit/chainlink/v2/core/services/chainlink false 10ms Unknown
TestConfig_full 0.00% false false false 3 0 3 0 github.com/smartcontractkit/chainlink/v2/core/services/chainlink false 10ms Unknown
Test_generalConfig_LogConfiguration 0.00% false false false 3 0 3 0 github.com/smartcontractkit/chainlink/v2/core/services/chainlink false 23.333333ms Unknown
Test_generalConfig_LogConfiguration/empty 0.00% false false false 3 0 3 0 github.com/smartcontractkit/chainlink/v2/core/services/chainlink false 0s Unknown
Test_generalConfig_LogConfiguration/full 0.00% false false false 3 0 3 0 github.com/smartcontractkit/chainlink/v2/core/services/chainlink false 10ms Unknown
Test_generalConfig_LogConfiguration/multi-chain 0.00% false false false 3 0 3 0 github.com/smartcontractkit/chainlink/v2/core/services/chainlink false 10ms Unknown

Artifacts

For detailed logs of the failed tests, please refer to the artifact failed-test-results-with-logs.json.

samsondav added a commit that referenced this pull request Feb 3, 2025
We have observed unbounded queue growth in production. Not sure of the cause. This PR implements a swathe of measures intended to make queue management more reliable and safe.

- Reduces required number of DB connections and reduce overall DB transaction load
- Remove duplicate deletion code from server.go and manage everything in the persistence manager
- Introduce an application-wide global reaper for last-ditch cleanup effort
- Implement delete batching for more reliable and incremental deletion
@samsondav samsondav changed the title Refactor LLO transmission queue cleanup Fix potential unbounded LLO transmit queue Feb 5, 2025
msuchacz-cll
msuchacz-cll previously approved these changes Feb 5, 2025
} else {
clients := make(map[string]grpc.Client)
for _, server := range lloCfg.GetServers() {
var client grpc.Client
switch r.mercuryCfg.Transmitter().Protocol() {
case config.MercuryTransmitterProtocolGRPC:
client = grpc.NewClient(grpc.ClientOpts{
Logger: r.lggr,
Logger: lggr.Named(server.URL),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What sort of url? Will it read OK given the names are dot separated?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like e.g. example.com/ws or 192.0.2.1:2345/foo/bar

The only differentiating factor between servers is the URL, and the logger name needs to be globally unique, so this was the only way I could think of to do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the logger name needs to be unique in order to report health independently, but for the purposes of just logging there is no technical conflict with sharing the log - what if we added it as a key/val? Would that be clear enough?

Suggested change
Logger: lggr.Named(server.URL),
Logger: lggr.With("url", server.URL),

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Logger: lggr.Named(server.URL),
Logger: lggr.Named(fmt.Sprintf("%q", server.URL)),

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do this in a followup

jmank88
jmank88 previously approved these changes Feb 5, 2025
@samsondav samsondav added this pull request to the merge queue Feb 5, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 5, 2025
@samsondav samsondav added this pull request to the merge queue Feb 5, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 5, 2025
@samsondav samsondav added this pull request to the merge queue Feb 5, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 5, 2025
@samsondav samsondav added this pull request to the merge queue Feb 5, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 5, 2025
@samsondav samsondav added this pull request to the merge queue Feb 5, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 5, 2025
Bugfixes and improvements intended to make queue management more reliable and safe.

- Reduces required number of DB connections and reduce overall DB transaction load
- Fixes potential for unbounded queue growth
- Fixes possibility of OOM trying to load too many records on boot
- Remove duplicate deletion code from server.go and manage everything in the persistence manager
- Introduce an application-wide global reaper for last-ditch cleanup effort
- Implement delete batching for more reliable and incremental deletion
- Ensure that records are properly removed on exit
@samsondav samsondav dismissed stale reviews from jmank88 and msuchacz-cll via 1222cbb February 5, 2025 16:02
@samsondav samsondav force-pushed the set_initial_metrics branch from d73a39c to 1222cbb Compare February 5, 2025 16:02
@samsondav samsondav enabled auto-merge February 5, 2025 16:08
@samsondav samsondav added this pull request to the merge queue Feb 5, 2025
Merged via the queue into develop with commit 3a7cfeb Feb 5, 2025
187 of 188 checks passed
@samsondav samsondav deleted the set_initial_metrics branch February 5, 2025 17:37
continue
}
} else {
t.lggr.Debugw("Reaped stale transmissions", "nDeleted", n)
Copy link
Contributor

@bolekk bolekk Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing break? If we get here, we're just spinning in the for loop and spamming logs like crazy :)

Also the Reaper gets enabled by default on unrelated DONs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants