-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QP reverse proxy gets stuck when the failure of the underlying tcp connection is not detected #14776
Comments
Reading the golang issues it sounds like we can mitigate this failure by setting either |
@dprotaso hi, I managed to debug the python grpc server too. I tested the following scenario: b) issue the bad http request c) send a failed grpc request transformer log: QP log: The bad request on the python logs creates: E0111 11:29:04.370439388 30 hpack_parser.cc:999] Error parsing 'content-type' metadata: invalid value Then all requests fail with: E0111 11:29:07.103099567 30 parsing.cc:913] INCOMING[0x7ff9a80023c0;0x7ff9a8015b80]: Parse failed with INTERNAL: Error parsing 'content-type' metadata [type.googleapis.com/grpc.status.int.stream_id='0'] This looks very similar to: grpc/grpc#34721. There are two ways to fix this: I will try to discuss at the grpc project side.
An "ugly" workaround but not a fix is to try restart connections at the golang side by deliberately making the server say GO_AWAY and terminate current connection. This happens because we send pings at different rates eg. dialer has a 5s keep alive and above I also set a keep alive at 10 sec at the transport level (my assumption here). However, we might need to set the appropriate ReadIdleTimeout or PingTimeout settings as in Azure/azure-sdk-for-go-extensions#29 for all workloads in order to deal with golang/go#59690. That issue is caused by a dropped tcp connection the current here is not but the same settings are useful in both scenarios for different reasons. It needs investigation if the same setting values workaround both issues at once. To summarize: |
As mentioned in the grpcio issue (grpc/grpc#34721 (comment)), the issue is not present anymore, at least with |
/area networking
What version of Knative?
1.10+
Expected Behavior
Downstream we face an issue similar to golang/go#59690, see the related blog post here. It happens with a faulty request that kills the connection but reverse proxy in QP never gets informed about it. I followed the workaround here golang/go#59690 (comment) and it allows the connection to be re-established, however once a connection is deliberately failed it takes some time until it gets re-created and that time depends on the configuration (a few secs).
Specifically I changed https://github.com/knative/serving/blob/main/vendor/knative.dev/pkg/network/h2c.go#L49:
to:
Details about the fields:
Here is the output of the QP without and with the workaround:
flan-t5-small-caikit-predictor-00001-deployment-84bc77fd7d5v4qj-queue-no-workaround.log
flan-t5-small-caikit-predictor-00001-deployment-9bb78c88c-2g6jx-queue-proxy-with-work.log
The critical part is this:
The last msg shows that connection is over. Also later on you see new connections via entries like:
As a side note we already have a keepalive set to 5s set with the Dialer (https://github.com/knative/serving/blob/main/vendor/knative.dev/pkg/network/transports.go#L87) but the extra ones make Server send the GOAWAY reply.
In any case we need to protect reverse proxy for these kind of issues and until this is fixed at the Golang side we need to at least provide some mechanism for working around this. For now I propose we expose the transport configuration eg. via env vars or some annotation in order users to be able to fix this (to some extent). The downside is that configuring transport depends on the app traffic rate and if there is no traffic this will keep re-creating the connection. Also if you have many requests at the time the issue hits all will fail until connection recovers. In any case it does not protect apps from someone exploiting this deliberately.
Actual Behavior
QP gets stuck and can't serve traffic. Note here that restarting either the app or the QP container resolves the issue since connection is re-established (see more https://issues.redhat.com/browse/RHOAIENG-165).
Steps to Reproduce the Problem
The reproducer is quite complex and you need to follow steps here: https://github.com/skonto/debug-caikit-serving/tree/main. It has been observed on OCP so far. Locally I tried to create a small reproducer also check here, but connection seems not to get stuck and I am still working on it.
The text was updated successfully, but these errors were encountered: