Fix potential unbounded LLO transmit queue #16166

samsondav · 2025-01-31T12:48:48Z

We have observed unbounded queue growth in production. This PR implements a swathe of bugfixes and improvements intended to make queue management more reliable and safe.

Reduces required number of DB connections and reduce overall DB transaction load
Remove duplicate deletion code from server.go and manage everything in the persistence manager
Introduce an application-wide global reaper for last-ditch cleanup effort
Implement delete batching for more reliable and incremental deletion

github-actions · 2025-01-31T16:57:22Z

AER Report: CI Core ran successfully ✅

aer_workflow , commit

AER Report: Operator UI CI ran successfully ✅

aer_workflow , commit

core/services/llo/mercurytransmitter/persistence_manager.go

We have observed unbounded queue growth in production. Not sure of the cause. This PR implements a swathe of measures intended to make queue management more reliable and safe. - Reduces required number of DB connections and reduce overall DB transaction load - Remove duplicate deletion code from server.go and manage everything in the persistence manager - Introduce an application-wide global reaper for last-ditch cleanup effort - Implement delete batching for more reliable and incremental deletion

github-actions · 2025-02-03T14:19:00Z

Flakeguard Summary

Ran new or updated tests between develop and 5376160 (set_initial_metrics).

View Flaky Detector Details | Compare Changes

Found Flaky Tests ❌

Name	Pass Ratio	Panicked?	Timed Out?	Race?	Runs	Failures	Package	Package Panicked?	Avg Duration	Code Owners
TestConfig_Marshal	0.00%	false	false	false	3	3	github.com/smartcontractkit/chainlink/v2/core/services/chainlink	false	40ms	Unknown
TestConfig_Marshal/Mercury	0.00%	false	false	false	3	3	github.com/smartcontractkit/chainlink/v2/core/services/chainlink	false	0s	Unknown
TestConfig_Marshal/full	0.00%	false	false	false	3	3	github.com/smartcontractkit/chainlink/v2/core/services/chainlink	false	10ms	Unknown
TestConfig_full	0.00%	false	false	false	3	3	github.com/smartcontractkit/chainlink/v2/core/services/chainlink	false	10ms	Unknown
Test_generalConfig_LogConfiguration	0.00%	false	false	false	3	3	github.com/smartcontractkit/chainlink/v2/core/services/chainlink	false	20ms	Unknown
Test_generalConfig_LogConfiguration/empty	0.00%	false	false	false	3	3	github.com/smartcontractkit/chainlink/v2/core/services/chainlink	false	0s	Unknown
Test_generalConfig_LogConfiguration/full	0.00%	false	false	false	3	3	github.com/smartcontractkit/chainlink/v2/core/services/chainlink	false	10ms	Unknown
Test_generalConfig_LogConfiguration/multi-chain	0.00%	false	false	false	3	3	github.com/smartcontractkit/chainlink/v2/core/services/chainlink	false	10ms	Unknown

Artifacts

For detailed logs of the failed tests, please refer to the artifact failed-test-results-with-logs.json.

We have observed unbounded queue growth in production. Not sure of the cause. This PR implements a swathe of measures intended to make queue management more reliable and safe. - Reduces required number of DB connections and reduce overall DB transaction load - Remove duplicate deletion code from server.go and manage everything in the persistence manager - Introduce an application-wide global reaper for last-ditch cleanup effort - Implement delete batching for more reliable and incremental deletion

github-actions · 2025-02-03T14:44:33Z

Flakeguard Summary

Ran new or updated tests between develop and 9bec7e0 (set_initial_metrics).

View Flaky Detector Details | Compare Changes

Found Flaky Tests ❌

Name	Pass Ratio	Panicked?	Timed Out?	Race?	Runs	Failures	Package	Package Panicked?	Avg Duration	Code Owners
TestConfig_Marshal	0.00%	false	false	false	3	3	github.com/smartcontractkit/chainlink/v2/core/services/chainlink	false	40ms	Unknown
TestConfig_Marshal/Mercury	0.00%	false	false	false	3	3	github.com/smartcontractkit/chainlink/v2/core/services/chainlink	false	0s	Unknown
TestConfig_Marshal/full	0.00%	false	false	false	3	3	github.com/smartcontractkit/chainlink/v2/core/services/chainlink	false	10ms	Unknown
TestConfig_full	0.00%	false	false	false	3	3	github.com/smartcontractkit/chainlink/v2/core/services/chainlink	false	10ms	Unknown
Test_generalConfig_LogConfiguration	0.00%	false	false	false	3	3	github.com/smartcontractkit/chainlink/v2/core/services/chainlink	false	23.333333ms	Unknown
Test_generalConfig_LogConfiguration/empty	0.00%	false	false	false	3	3	github.com/smartcontractkit/chainlink/v2/core/services/chainlink	false	0s	Unknown
Test_generalConfig_LogConfiguration/full	0.00%	false	false	false	3	3	github.com/smartcontractkit/chainlink/v2/core/services/chainlink	false	10ms	Unknown
Test_generalConfig_LogConfiguration/multi-chain	0.00%	false	false	false	3	3	github.com/smartcontractkit/chainlink/v2/core/services/chainlink	false	10ms	Unknown

Artifacts

For detailed logs of the failed tests, please refer to the artifact failed-test-results-with-logs.json.

core/services/llo/cleanup.go

We have observed unbounded queue growth in production. Not sure of the cause. This PR implements a swathe of measures intended to make queue management more reliable and safe. - Reduces required number of DB connections and reduce overall DB transaction load - Remove duplicate deletion code from server.go and manage everything in the persistence manager - Introduce an application-wide global reaper for last-ditch cleanup effort - Implement delete batching for more reliable and incremental deletion

jmank88 · 2025-02-05T13:11:24Z

core/services/relay/evm/evm.go

 	} else {
 		clients := make(map[string]grpc.Client)
 		for _, server := range lloCfg.GetServers() {
 			var client grpc.Client
 			switch r.mercuryCfg.Transmitter().Protocol() {
 			case config.MercuryTransmitterProtocolGRPC:
 				client = grpc.NewClient(grpc.ClientOpts{
-					Logger:        r.lggr,
+					Logger:        lggr.Named(server.URL),


What sort of url? Will it read OK given the names are dot separated?

Looks like e.g. example.com/ws or 192.0.2.1:2345/foo/bar

The only differentiating factor between servers is the URL, and the logger name needs to be globally unique, so this was the only way I could think of to do it.

So the logger name needs to be unique in order to report health independently, but for the purposes of just logging there is no technical conflict with sharing the log - what if we added it as a key/val? Would that be clear enough?

Suggested change

Logger: lggr.Named(server.URL),

Logger: lggr.With("url", server.URL),

Suggested change

Logger: lggr.Named(server.URL),

Logger: lggr.Named(fmt.Sprintf("%q", server.URL)),

will do this in a followup

Bugfixes and improvements intended to make queue management more reliable and safe. - Reduces required number of DB connections and reduce overall DB transaction load - Fixes potential for unbounded queue growth - Fixes possibility of OOM trying to load too many records on boot - Remove duplicate deletion code from server.go and manage everything in the persistence manager - Introduce an application-wide global reaper for last-ditch cleanup effort - Implement delete batching for more reliable and incremental deletion - Ensure that records are properly removed on exit

cl-sonarqube-production · 2025-02-05T17:01:38Z

Quality Gate passed

Issues
3 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

bolekk · 2025-02-05T21:44:30Z

core/services/llo/cleanup.go

+						continue
+					}
+				} else {
+					t.lggr.Debugw("Reaped stale transmissions", "nDeleted", n)


Missing break? If we get here, we're just spinning in the for loop and spamming logs like crazy :)

Also the Reaper gets enabled by default on unrelated DONs.

samsondav requested review from a team as code owners January 31, 2025 12:48

ro-tex previously approved these changes Jan 31, 2025

View reviewed changes

samsondav enabled auto-merge January 31, 2025 13:38

samsondav dismissed ro-tex’s stale review via 32eba2d January 31, 2025 15:49

samsondav changed the title ~~Initial set to populate gauge metrics~~ Optimize LLO transmitter queue deletes using batching Jan 31, 2025

samsondav force-pushed the set_initial_metrics branch 7 times, most recently from 0f4bb4e to 80ca5cf Compare January 31, 2025 16:24

samsondav commented Jan 31, 2025

View reviewed changes

core/services/llo/mercurytransmitter/persistence_manager.go Show resolved Hide resolved

samsondav requested review from a team as code owners February 3, 2025 13:57

samsondav requested a review from krehermann February 3, 2025 13:57

samsondav force-pushed the set_initial_metrics branch from ed8eb21 to 4ddee6c Compare February 3, 2025 13:58

samsondav changed the title ~~Optimize LLO transmitter queue deletes using batching~~ Refactor LLO transmission queue cleanup Feb 3, 2025

samsondav force-pushed the set_initial_metrics branch from 4ddee6c to 5376160 Compare February 3, 2025 14:01

samsondav force-pushed the set_initial_metrics branch from 5376160 to 9bec7e0 Compare February 3, 2025 14:20

msuchacz-cll reviewed Feb 3, 2025

View reviewed changes

core/services/llo/cleanup.go Outdated Show resolved Hide resolved

msuchacz-cll reviewed Feb 3, 2025

View reviewed changes

core/services/llo/cleanup.go Outdated Show resolved Hide resolved

msuchacz-cll reviewed Feb 3, 2025

View reviewed changes

core/services/llo/cleanup.go Outdated Show resolved Hide resolved

samsondav changed the title ~~Refactor LLO transmission queue cleanup~~ Fix potential unbounded LLO transmit queue Feb 5, 2025

msuchacz-cll previously approved these changes Feb 5, 2025

View reviewed changes

jmank88 reviewed Feb 5, 2025

View reviewed changes

jmank88 previously approved these changes Feb 5, 2025

View reviewed changes

samsondav added this pull request to the merge queue Feb 5, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 5, 2025