Make osq runner responsive to registration updates #2007

zackattack01 · 2024-12-17T21:28:09Z

reworks interfaces for runner change detection
- knapsack's querier is currently implemented by our runner, but we need the runner to do more here. This adds an OsqRunner interface and which encompasses the previous InstanceQuerier interface and adds our new RegistrationChangeHandler requirements
- adds an UpdateRegistrationIDs method to the runner which will detect any changes and restart the instances accordingly

To allow for a graceful restart of all known instances after registration IDs has been updated, a few tricky changes were required that I'd like more eyes on. context:

within our Shutdown method, we currently close the shutdown channel to signal our shutdown intent. This works well but isn't compatible with needing to read from that channel again when the runner would stay up (e.g. updated registration ID restarts). resetting the channel did not feel correct and introduces data races due to the way we need to thread/use wait groups here
I initially tried reworking things to have alternate shutdown patterns for reloading/restarting vs actually shutting down, and adding a secondary reload channel - but we would still need the shutdown channel to stay open
updating to just send on the shutdown channel and leave it open breaks all shutdown functionality, because there is no way to sync the timing required here across multiple instances. send will block until read on the unbuffered channel, but depending on the state of any given instance at the time the shutdown is called, we may not have that select block open (we block on instance.Exited() above that)
so, we need r.shutdown to hold that message- the reason this worked as expected with close(r.shutdown) is because that channel will read a zero value and fire immediately whenever it is read next, even if the select wasn't open before the close was called
the only way I could think to get around this was to instead make that a buffered channel, and ensure we send one shutdown message per instance to see

If anyone has alternate suggestions please let me know!

RebeccaMahany · 2024-12-20T16:27:45Z

pkg/osquery/runtime/runner.go

@@ -49,6 +53,31 @@ func New(k types.Knapsack, serviceClient service.KolideService, opts ...OsqueryI
 }

 func (r *Runner) Run() error {
+	for {
+		// if our instances ever exit unexpectedly, return immediately


I think this comment comment isn't accurate? I think runRegisteredInstances only returns if a shutdown was requested. And it only returns an error if a) shutdown was requested and b) we were trying to restart one or more instances during that time and c) hadn't successfully restarted one of them yet.

Either way, why would we return the error here instead of checking rerunRequired first?

that's a good point thank you. we'd probably want to rerun regardless if required, I will update that comment and get this fixed up!

pkg/osquery/runtime/runner.go

…n reload is required

…unner_registration_ids

…gistration_ids

RebeccaMahany

This makes sense to me! Really nice.

pkg/osquery/runtime/runner.go

RebeccaMahany · 2025-01-28T15:52:23Z

pkg/osquery/runtime/runner.go

+		return nil
+	}
+
+	r.slogger.Log(context.TODO(), slog.LevelDebug,


Nit -- I think reasonable to log this at the info level (so that we ship this log to the cloud)

oh good call, will do!

RebeccaMahany · 2025-01-28T15:54:44Z

pkg/osquery/runtime/runner.go

@@ -205,6 +253,13 @@ func (r *Runner) Query(query string) ([]map[string]string, error) {
 }

 func (r *Runner) Interrupt(_ error) {
+	if r.interrupted.Load() {


Do you think it'd be useful to also do r.rerunRequired.Store(false) here? I'm thinking about the case where UpdateRegistrationIDs is called and then Interrupt is called while the shutdown/restart is ongoing. Could maybe add a test case for this as well?

ohh yeah that seems wise. will do both!

RebeccaMahany · 2025-01-28T15:57:17Z

pkg/osquery/runtime/runner.go

 	instances       map[string]*OsqueryInstance // maps registration ID to currently-running instance
 	instanceLock    sync.Mutex                  // locks access to `instances` to avoid e.g. restarting an instance that isn't running yet
 	slogger         *slog.Logger
 	knapsack        types.Knapsack
 	serviceClient   service.KolideService   // shared service client for communication between osquery instance and Kolide SaaS
 	settingsWriter  settingsStoreWriter     // writes to startup settings store
 	opts            []OsqueryInstanceOption // global options applying to all osquery instances
-	shutdown        chan struct{}
+	shutdown        chan struct{}           // buffered shutdown channel for to enable shutting down to restart or exit
+	rerunRequired   atomic.Bool


Maybe add a quick comment here? // if set, the runner will restart all instances after Shutdown is called, instead of exiting or similar

Co-authored-by: Rebecca Mahany-Horton <[email protected]>

James-Pickett · 2025-01-28T16:15:13Z

pkg/osquery/runtime/runner.go

@@ -27,14 +28,16 @@ type settingsStoreWriter interface {

 type Runner struct {
 	registrationIds []string                    // we expect to run one instance per registration ID
+	regIDLock       sync.Mutex                  // locks access to registrationIds


I discovered that if we are locking for a single value, there is an atmoic.Value or atmoic.Pointer that can be used, like:

var registrationIds atomic.Value registrationIds.Store([]string{"a", "b", "c"}) theIds := registrationIds.Load().([]string)

or

var registrationIds atomic.Pointer[[]string] registrationIds.Store(&[]string{"a", "b", "c"}) theIds := registrationIds.Load()

the internets generally say that atmoics are faster ... but I doubt that it makes any noticeable difference for us, also pointer to an array feels weird and I'm not sure of the implications

feel free to not act on this, I don't know if it's any better

James-Pickett · 2025-01-28T16:53:11Z

pkg/osquery/runtime/runner.go

+	for range r.instances {
+		r.shutdown <- struct{}{}
 	}


what about giving each instance it's own shutdown channel to avoid the buffered channel?

Suggested change

for range r.instances {

r.shutdown <- struct{}{}

}

for instance := range r.instances {

close(instance.shutdown)

}

oh yeah i'm gonna give this a try, ty! I'm actually seeing some interesting behavior while trying to write a test for Interrupt during restart per Becca's comment here and I wonder if this wouldn't clear things up a little. the behavior I'm seeing in the test is that the interrupt can end up ignored if timed correctly and I think It's related to the buffered channel use so far

zackattack01 force-pushed the zack/runner_registration_ids branch from a1d2603 to 01e4063 Compare December 19, 2024 17:35

zackattack01 added 5 commits December 19, 2024 13:26

rework interfaces for runner change detection

0c31928

gofmt

81e4209

restarts seem to be working this way

e028a65

shift to using buffered shutdown channel, rework

b9e1918

cleanup and add comment

2d74749

zackattack01 force-pushed the zack/runner_registration_ids branch from 01e4063 to 2d74749 Compare December 19, 2024 18:27

don't parallel test for runtime

db0b434

RebeccaMahany reviewed Dec 20, 2024

View reviewed changes

pkg/osquery/runtime/runner.go Show resolved Hide resolved

zackattack01 and others added 11 commits December 20, 2024 12:33

PR feedback: add registrationID lock, and fix error handling flow whe…

81d84d9

…n reload is required

fix merge conflict

a4168fb

put tests back to parallel

c20b160

Merge branch 'main' of https://github.com/kolide/launcher into zack/r…

b5451ff

…unner_registration_ids

update var names from rebase

c20d34a

Merge branch 'main' of github.com:kolide/launcher into zack/runner_re…

6e9172a

…gistration_ids

update tests for new mocking and cleanup patterns

b6029da

pull in main and fix conflicts

25edcce

fix data races, add comments

bccda00

Merge branch 'main' into zack/runner_registration_ids

506e14d

Merge branch 'main' into zack/runner_registration_ids

49654cb

zackattack01 marked this pull request as ready for review January 24, 2025 18:47

zackattack01 and others added 5 commits January 27, 2025 13:18

fix up tests with new startupsettingswriter mocks

4060380

Merge branch 'main' into zack/runner_registration_ids

2c70a29

Merge branch 'main' into zack/runner_registration_ids

c6592cd

don't acquire locks for runner's String method

f551f6b

Merge branch 'main' into zack/runner_registration_ids

5180768

RebeccaMahany reviewed Jan 28, 2025

View reviewed changes

Update pkg/osquery/runtime/runner.go

658307c

Co-authored-by: Rebecca Mahany-Horton <[email protected]>

James-Pickett reviewed Jan 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make osq runner responsive to registration updates #2007

Make osq runner responsive to registration updates #2007

zackattack01 commented Dec 17, 2024 •

edited

Loading

RebeccaMahany Dec 20, 2024

zackattack01 Dec 20, 2024

RebeccaMahany left a comment

RebeccaMahany Jan 28, 2025

zackattack01 Jan 28, 2025

RebeccaMahany Jan 28, 2025

zackattack01 Jan 28, 2025

RebeccaMahany Jan 28, 2025

James-Pickett Jan 28, 2025

James-Pickett Jan 28, 2025

zackattack01 Jan 28, 2025

Make osq runner responsive to registration updates #2007

Are you sure you want to change the base?

Make osq runner responsive to registration updates #2007

Conversation

zackattack01 commented Dec 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RebeccaMahany left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zackattack01 commented Dec 17, 2024 •

edited

Loading