-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition in xgress_edge_tunnel tunneller at start but not seen in pre-compiled binary #1414
Comments
Hi @arunsworld. Can you describe how you're starting the controller and the router? How'd you generate the config files? Are you willing to share the configuration if needed? Is there anything "important or special" that might be relevant? Does the locally built router run using the exact same configuration? |
Hi @dovholuknf. Happy to share the configs. I got started using the quick start expressInstall and for the most part the config is untouched except minor things around ports (eg. there aren't 2 properties for ADVERTISED vs BIND port and in my case I wanted the ADVERTISED port to be 443 while the BIND port was an arbitrary port). I think the only relevant bit is the controller is a different data centre so there is a bit of latency. However, with everything being 100% the same, I observe the behaviour difference between one binary version and the other and it's perfectly reproducible. To reproduce the issue I turn off Tunneller Enabled but leave my router config as below:
Here are the logs from the released version (v0.30.4):
The line to look at is the tunnelling code is only executed at 0.984 seconds and crucially after the "successfully connected to controller" line. Here are the logs from the compiled version (HEAD of release-next, but also tried v0.30.4 tag):
The difference noted here is the tunnel config is already taking place around 0.206s in, but the controller connection only completes later. Further debugging basically reveals that the authentication step receives 0 controllers and skips over and that code is never visited again. When I change the code in
In short this has quite stumped me :) I couldn't find anything in the build process for the released version that could account for the difference in behaviour running the code that it's running. Hopefully you can see something I missed. |
Update: I noticed the fix on branch I take it the goroutine exists so that multiple controllers can be attempted concurrently; and so we need a wait - at least for one. Only mystery (to me at least) is why it works in the released version. |
Correct. I also tried to reproduce and wasn't able to with my locally built version. One is guess, it's maybe b/c I'm running the controller and router locally and the controller connection is fast enough. Another guess is that maybe you're compiling with Go 1.21? We're still using Go 1.20, as we usually wait for a point release or two before updating (which means it's about time to update). I might try with Go 1.21 and see if I can get it to fail as you described. Thank you for testing that the fix worked for you! |
Ensure control chan is available before starting xctrls. Fixes #1414
Hi,
I'm seeing the following issue in the edge router when I compile ziti from source that I don't see with the released version. I can see why I have the issue (a race condition) that I'm able to fix but I cannot explain why the released version works. Any help is appreciated.
The issue I'm facing is with the tunneller during controlPlane startup. At startup (Router.startControlPlane), the control endpoints are setup (
self.ctrls.UpdateControllerEndpoints(endpoints)
) followed by Run calls to xrctrls, one of which is the tunnel Factory. The NetworkController setup however completes in a goroutine and takes about 0.5 seconds; meanwhile however the tunnel setup finds no controllers and effectively skips the tunnel setup.A quick test is by disabling tunnelling for the router in the controller but leaving the configuration in for the router.
Expected behaviour: failure to start with:
FATAL ziti/router/xgress_edge_tunnel.(*servicePoller).pollServices: {error=[tunneling not enabled]} xgress_edge_tunnel unable to authenticate to controller. ensure tunneler mode is enabled for this router or disable tunnel listener. exiting
Observed behaviour: No error. Router starts but no tunnelling functionality available.
In debugging I can see that
ctrlMap := self.factory.ctrls.GetAll()
- line 172 infabric.go
- returns an empty map so there is no session setup for the controller and it's never visited again.When I remove the goroutine for the UpdateControllerEndpoints (within
connectToControllerWithBackoff
in env package, ctrls.go), the code waits for the controller to be setup before proceeding ensuring tunnelling is setup OK.This is strangely the behaviour I'm seeing when I use the pre-compiled binary (v0.30.4) but I cannot explain why the pre-compiled binary is working based on the code.
Any help is therefore appreciated and happy to add more information as needed.
Thanks,
Arun
The text was updated successfully, but these errors were encountered: