Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in xgress_edge_tunnel tunneller at start but not seen in pre-compiled binary #1414

Closed
arunsworld opened this issue Oct 6, 2023 · 4 comments

Comments

@arunsworld
Copy link

Hi,

I'm seeing the following issue in the edge router when I compile ziti from source that I don't see with the released version. I can see why I have the issue (a race condition) that I'm able to fix but I cannot explain why the released version works. Any help is appreciated.

The issue I'm facing is with the tunneller during controlPlane startup. At startup (Router.startControlPlane), the control endpoints are setup (self.ctrls.UpdateControllerEndpoints(endpoints)) followed by Run calls to xrctrls, one of which is the tunnel Factory. The NetworkController setup however completes in a goroutine and takes about 0.5 seconds; meanwhile however the tunnel setup finds no controllers and effectively skips the tunnel setup.

A quick test is by disabling tunnelling for the router in the controller but leaving the configuration in for the router.

Expected behaviour: failure to start with: FATAL ziti/router/xgress_edge_tunnel.(*servicePoller).pollServices: {error=[tunneling not enabled]} xgress_edge_tunnel unable to authenticate to controller. ensure tunneler mode is enabled for this router or disable tunnel listener. exiting

Observed behaviour: No error. Router starts but no tunnelling functionality available.

In debugging I can see that ctrlMap := self.factory.ctrls.GetAll() - line 172 in fabric.go - returns an empty map so there is no session setup for the controller and it's never visited again.

When I remove the goroutine for the UpdateControllerEndpoints (within connectToControllerWithBackoff in env package, ctrls.go), the code waits for the controller to be setup before proceeding ensuring tunnelling is setup OK.

This is strangely the behaviour I'm seeing when I use the pre-compiled binary (v0.30.4) but I cannot explain why the pre-compiled binary is working based on the code.

Any help is therefore appreciated and happy to add more information as needed.

Thanks,
Arun

@dovholuknf
Copy link
Member

Hi @arunsworld. Can you describe how you're starting the controller and the router? How'd you generate the config files? Are you willing to share the configuration if needed? Is there anything "important or special" that might be relevant? Does the locally built router run using the exact same configuration?

@arunsworld
Copy link
Author

arunsworld commented Oct 6, 2023

Hi @dovholuknf. Happy to share the configs. I got started using the quick start expressInstall and for the most part the config is untouched except minor things around ports (eg. there aren't 2 properties for ADVERTISED vs BIND port and in my case I wanted the ADVERTISED port to be 443 while the BIND port was an arbitrary port). I think the only relevant bit is the controller is a different data centre so there is a bit of latency.

However, with everything being 100% the same, I observe the behaviour difference between one binary version and the other and it's perfectly reproducible.

To reproduce the issue I turn off Tunneller Enabled but leave my router config as below:

listeners:
# bindings of edge and tunnel requires an "edge" section below
  - binding: edge
    address: tls:0.0.0.0:8442
    options:
      advertise: localhost:8442
      connectTimeoutMs: 5000
      getSessionTimeout: 60
  - binding: tunnel
    options:
      mode: host #tproxy|host

Here are the logs from the released version (v0.30.4):

$ ziti router run router.yaml

[   0.038]    INFO ziti/ziti/router.run: {revision=[b9bf4d955c39] version=[v0.30.4] go-version=[go1.20.8] routerId=[5Wd7iovsP] build-date=[2023-09-30T19:09:27Z] os=[darwin] configFile=[router.yaml] arch=[amd64]} starting ziti-router
[   0.040]    INFO ziti/common/metrics.GoroutinesPoolMetricsConfigF.func1.1: {maxQueueSize=[1000] minWorkers=[0] maxWorkers=[32] poolType=[pool.link.dialer] idleTime=[30s]} starting goroutine pool
[   0.040]    INFO ziti/router/forwarder.(*Scanner).run: started
[   0.041]    INFO ziti/router/forwarder.(*Faulter).run: started
...
[   0.054]    INFO ziti/router.(*Router).startXgressListeners: created xgress listener [tunnel] at []
[   0.055]    INFO ziti/router.(*Router).getInitialCtrlEndpoints: controller endpoints file [endpoints] doesn't exist. Using initial endpoints from config
[   0.055]    INFO ziti/router.(*Router).startControlPlane: router configured with 1 controller endpoints
[   0.055]    INFO ziti/router/env.(*networkControllers).UpdateControllerEndpoints: {endpoint=[map[tls:ziti-router-ctrl.mydomain.com:443:{}]]} adding new ctrl endpoint
[   0.056]    INFO ziti/router/env.(*networkControllers).connectToControllerWithBackoff: {endpoint=[tls:ziti-router-ctrl.mydomain.com:443]} starting connection attempts
[   0.056]    INFO ziti/router/fabric.(*StateManagerImpl).StartHeartbeat: heartbeat starting
[   0.057]    INFO ziti/router/xgress_edge_tunnel.(*tunneler).Start: {mode=[host]} creating interceptor
[   0.058] WARNING ziti/tunnel/dns.flushDnsCaches: {error=[exec: "resolvectl": executable file not found in $PATH]} unable to find systemd-resolve or resolvectl in path, consider adding a dns flush to your restart process
[   0.059]    INFO ziti/router/xgress_edge.(*CertExpirationChecker).Run: waiting 8607h54m46.01093s to renew certificates
[   0.966]    INFO ziti/router/env.(*networkControllers).connectToControllerWithBackoff.func3: {endpoint=[tls:ziti-router-ctrl.mydomain.com:443]} successfully connected to controller
[   0.969]    INFO ziti/router/handler_edge_ctrl.(*helloHandler).HandleReceive.func1: received server hello, replying
[   0.970]    INFO ziti/router/fabric.(*StateManagerImpl).AddSignerPublicCert: {received=[1] added=[1] ignored=[0]} received signer public certificates
[   0.972]    INFO ziti/router/link.(*linkDestUpdate).ApplyListenerChanges: {routerId=[lT--ORmNUG] address=[tls:ziti-router-transport.mydomain.com:443] linkKey=[default->tls:lT--ORmNUG->default]} new potential link
[   0.972]    INFO ziti/router/link.(*linkDestUpdate).ApplyListenerChanges: {linkKey=[default->tls:lT--ORmNUG->default] routerId=[lT--ORmNUG] address=[tls:ziti-router-transport.mydomain.com:443]} link already known
[   0.984]    INFO ziti/tunnel/intercept.SetDnsInterceptIpRange: dns intercept IP range: 100.64.0.1 - 100.127.255.254
[   1.131]    INFO ziti/router/handler_edge_ctrl.(*apiSessionAddedHandler).instantSync: {strategy=[instant]} first api session syncId [clne6jtsx0w2kfhp7ydssdpal], starting
[   1.131]    INFO ziti/router/handler_edge_ctrl.(*apiSessionSyncTracker).Add: received api session sync chunk 0, isLast=true
[   1.133]   ERROR ziti/router/xgress_edge_tunnel.(*fabricProvider).authenticate.func1: {error=[tunneling not enabled] ctrlId=[ziti-router-ctrl.mydomain.com]} failed to authenticate
[   1.133]   FATAL ziti/router/xgress_edge_tunnel.(*servicePoller).pollServices: {error=[tunneling not enabled]} xgress_edge_tunnel unable to authenticate to controller. ensure tunneler mode is enabled for this router or disable tunnel listener. exiting

The line to look at is the tunnelling code is only executed at 0.984 seconds and crucially after the "successfully connected to controller" line.

Here are the logs from the compiled version (HEAD of release-next, but also tried v0.30.4 tag):

[   0.014]    INFO ziti/ziti/router.run: {build-date=[2020-01-01 01:01:01] version=[v0.0.0] go-version=[go1.21.1] routerId=[5Wd7iovsP] os=[darwin] revision=[local] configFile=[router.yaml] arch=[arm64]} starting ziti-router
[   0.015]    INFO ziti/common/metrics.ConfigureGoroutinesPoolMetrics.GoroutinesPoolMetricsConfigF.func1.1: {minWorkers=[0] maxWorkers=[32] maxQueueSize=[1000] idleTime=[30s] poolType=[pool.link.dialer]} starting goroutine pool
[   0.015]    INFO ziti/common/metrics.ConfigureGoroutinesPoolMetrics.GoroutinesPoolMetricsConfigF.func1.1: {maxQueueSize=[1000] poolType=[pool.route.handler] minWorkers=[0] maxWorkers=[128] idleTime=[30s]} starting goroutine pool
...
[   0.015]    INFO ziti/router.(*Router).startXgressListeners: created xgress listener [tunnel] at []
[   0.015]    INFO ziti/router.(*Router).getInitialCtrlEndpoints: controller endpoints file [endpoints] doesn't exist. Using initial endpoints from config
[   0.015]    INFO ziti/router.(*Router).startControlPlane: router configured with 1 controller endpoints
[   0.015]    INFO ziti/router/xgress_edge.(*Acceptor).Run: starting
[   0.016]    INFO ziti/router/env.(*networkControllers).UpdateControllerEndpoints: {endpoint=[map[tls:ziti-router-ctrl.mydomain.com:443:{}]]} adding new ctrl endpoint
[   0.016]    INFO ziti/router/env.(*networkControllers).connectToControllerWithBackoff: {endpoint=[tls:ziti-router-ctrl.mydomain.com:443]} starting connection attempts
[   0.016]    INFO ziti/router/fabric.(*StateManagerImpl).StartHeartbeat: heartbeat starting
[   0.016]    INFO ziti/router/xgress_edge_tunnel.(*tunneler).Start: {mode=[host]} creating interceptor
[   0.016]    INFO ziti/router/xgress_edge.(*CertExpirationChecker).Run: waiting 8607h47m51.887284s to renew certificates
[   0.016] WARNING ziti/tunnel/dns.flushDnsCaches: {error=[exec: "resolvectl": executable file not found in $PATH]} unable to find systemd-resolve or resolvectl in path, consider adding a dns flush to your restart process
[   0.206]    INFO ziti/tunnel/intercept.SetDnsInterceptIpRange: dns intercept IP range: 100.64.0.1 - 100.127.255.254
[   0.529]    INFO ziti/router/env.(*networkControllers).connectToControllerWithBackoff.func3: {endpoint=[tls:ziti-router-ctrl.mydomain.com:443]} successfully connected to controller
[   0.534]    INFO ziti/router/link.(*linkDestUpdate).ApplyListenerChanges: {address=[tls:ziti-router-transport.mydomain.com:443] routerId=[lT--ORmNUG] linkKey=[default->tls:lT--ORmNUG->default]} new potential link
[   0.534]    INFO ziti/router/link.(*linkDestUpdate).ApplyListenerChanges: {routerId=[lT--ORmNUG] address=[tls:ziti-router-transport.mydomain.com:443] linkKey=[default->tls:lT--ORmNUG->default]} link already known
[   0.534]    INFO ziti/router/handler_edge_ctrl.(*helloHandler).HandleReceive.func1: received server hello, replying
[   0.534]    INFO ziti/router/fabric.(*StateManagerImpl).AddSignerPublicCert: {added=[1] received=[1] ignored=[0]} received signer public certificates
[   0.686]    INFO ziti/router/handler_edge_ctrl.(*apiSessionAddedHandler).instantSync: {strategy=[instant]} first api session syncId [clne6sp1c0we3fhp7tl187ci4], starting
[   0.686]    INFO ziti/router/handler_edge_ctrl.(*apiSessionSyncTracker).Add: received api session sync chunk 0, isLast=true
[   1.056]    INFO ziti/router/handler_link.(*bindHandler).BindChannel: {linkId=[5yOtivUG8ZKczpU2f86hRe] routerId=[lT--ORmNUG] routerVersion=[v0.0.0]} link destination support heartbeats
[   1.561]    INFO ziti/router/handler_link.(*bindHandler).BindChannel: {linkId=[5yOtivUG8ZKczpU2f86hRe] routerId=[lT--ORmNUG] routerVersion=[v0.0.0]} link destination support heartbeats
[   1.563]    INFO ziti/router.(*xlinkAccepter).Accept: accepted new link [l/5yOtivUG8ZKczpU2f86hRe]
[   1.687]    INFO ziti/router/handler_edge_ctrl.(*apiSessionAddedHandler).applySync: finished sychronizing api sessions [count: 4, syncId: clne6sp1c0we3fhp7tl187ci4, duration: 19.334µs]

The difference noted here is the tunnel config is already taking place around 0.206s in, but the controller connection only completes later. Further debugging basically reveals that the authentication step receives 0 controllers and skips over and that code is never visited again.

When I change the code in router/env/ctrls.go as follows to remove doing the connection in a goroutine the observed behaviour is exactly as the released version:

func (self *networkControllers) connectToControllerWithBackoff(endpoint string) {
	log := pfxlog.Logger().WithField("endpoint", endpoint)
    
       .....

	log.Info("starting connection attempts")

	if err := backoff.Retry(operation, expBackoff); err != nil {
		log.WithError(err).Error("unable to connect controller, stopping retries.")
	} else {
		log.Info("successfully connected to controller")
	}

	// go func() {
	// 	if err := backoff.Retry(operation, expBackoff); err != nil {
	// 		log.WithError(err).Error("unable to connect controller, stopping retries.")
	// 	} else {
	// 		log.Info("successfully connected to controller")
	// 	}
	// }()
}

In short this has quite stumped me :) I couldn't find anything in the build process for the released version that could account for the difference in behaviour running the code that it's running. Hopefully you can see something I missed.

@arunsworld
Copy link
Author

Update: I noticed the fix on branch fix-xctrl-startup-race. I can confirm that does fix the issue. Thanks.

I take it the goroutine exists so that multiple controllers can be attempted concurrently; and so we need a wait - at least for one.

Only mystery (to me at least) is why it works in the released version.

@plorenz
Copy link
Member

plorenz commented Oct 6, 2023

Update: I noticed the fix on branch fix-xctrl-startup-race. I can confirm that does fix the issue. Thanks.

I take it the goroutine exists so that multiple controllers can be attempted concurrently; and so we need a wait - at least for one.

Only mystery (to me at least) is why it works in the released version.

Correct. I also tried to reproduce and wasn't able to with my locally built version. One is guess, it's maybe b/c I'm running the controller and router locally and the controller connection is fast enough. Another guess is that maybe you're compiling with Go 1.21? We're still using Go 1.20, as we usually wait for a point release or two before updating (which means it's about time to update). I might try with Go 1.21 and see if I can get it to fail as you described.

Thank you for testing that the fix worked for you!

plorenz added a commit that referenced this issue Oct 11, 2023
Ensure control chan is available before starting xctrls. Fixes #1414
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants