Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test: TestResolverRemovedWithRPCs #7799

Closed
dfawley opened this issue Nov 1, 2024 · 6 comments · Fixed by #7804
Closed

Flaky test: TestResolverRemovedWithRPCs #7799

dfawley opened this issue Nov 1, 2024 · 6 comments · Fixed by #7804
Assignees

Comments

@dfawley
Copy link
Member

dfawley commented Nov 1, 2024

This seems extremely flaky now. This is a run from master, but I see it flake on almost every snapshot in a PR today:

https://github.com/grpc/grpc-go/actions/runs/11632169595/job/32394719047#step:8:226

@arjan-bal
Copy link
Contributor

All the failures have the following warning:

tlogger.go:116: WARNING ads_stream.go:609 [xds] [xds-client 0xc0008a3300] [xds-channel 0xc000139b60] [ads-stream 0xc0008a3480] ADS stream received a response for resource "route-config-name", but no state exists for it (t=+3.781632ms)

@arjan-bal
Copy link
Contributor

The xds related PR submitted recently is #7773

I tried merging only upto the commit before #7773 into an unsubmitted PR #7758 and it didn't see the same flakes. I suspect the #7773 may be the root cause.

FYI @easwars

@arjan-bal
Copy link
Contributor

I saw the test fail even though my PR didn't include the the commit for #7773.

https://github.com/grpc/grpc-go/actions/runs/11662268875/job/32468286292?pr=7742

@easwars easwars self-assigned this Nov 4, 2024
@dfawley
Copy link
Member Author

dfawley commented Nov 4, 2024

I saw the test fail even though my PR didn't include the the commit for #7773.

I believe GA works by merging your PR onto master and then running the tests based on that. So your run maybe did include #7773 even if your branch on your fork did not.

@easwars
Copy link
Contributor

easwars commented Nov 4, 2024

This is the sequence of events that I see in the failing test:

  • Management server is configured with listener L and route configuration R.
  • xDS resolver requests these resources through the xDS client and receives updates for them.
  • xDS resolver sends a valid service config and RPCs work at this point.
  • The resources are removed on the management server.
  • xDS resolver sees a resource-not-found error for the listener resource and therefore stops watching route configuration R.
    • As part of this watch being canceled, the xDS client stops requesting the route configuration resource. The discovery request without this resource name is sent out asynchronously, but the internal state of the xDS client is updated synchronously to indicate that this resource is not being watched anymore.
  • The test reconfigures the listener and route configuration resources on the management server.
  • The management server immediately sends the route configuration resource to the xDS client, because it is yet to receive the discovery request that does not request this resource anymore.
  • The xDS client receives the route configuration resource, but it finds that it did not request this resource, and therefore drops it on the floor.
  • The xDS client also receives the listener resource and at this point, re-requests the route configuration resource.
  • But the management server does not send the resource again, because it thinks that it has already sent it.

This also seems related to envoyproxy/go-control-plane#431.

But I'm yet to figure out why this is happening so frequently now compared to the past.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants