Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating from go-grpc v1.64.0 to v1.68.1 broke our AWS deployments, migration path is unclear #7922

Closed
AnomalRoil opened this issue Dec 11, 2024 · 3 comments · Fixed by #7980
Assignees
Labels
Area: Auth Includes regular credentials API and implementation. Also includes advancedtls, authz, rbac etc. Type: Feature New features or improvements in behavior

Comments

@AnomalRoil
Copy link
Contributor

What version of gRPC are you using?

v1.68.1

What version of Go are you using (go version)?

go1.22.10

What operating system (Linux, Windows, …) and version?

Linux 4.14.348-265.565.amzn2.x86_64

What did you do?

We routinely updated our dependencies and released a new patch version without thinking too much about it since it built locally, passed integration tests, built in Github actions and all...

We deployed it in our test AWS environment and it is now somewhat broken:

  • the nodes on AWS using the newest version were unable to talk to each other, failing with:
    transport: authentication handshake failed: credentials: cannot check peer: missing selected ALPN property
    
  • the nodes on AWS using the newest version were still able to talk to nodes running the older version
  • the nodes on AWS were still able to receive gRPC calls from the older nodes

So it was really just the newer nodes being unable to talk to each other, while being fine with the older ones, which led us to discover #7769 which in turn led us to discover that release v1.67.0 introduced the new behaviour:

In accordance with RFC 7540, clients and servers will now reject TLS connections that don't support ALPN. This can be disabled by setting the environment variable GRPC_ENFORCE_ALPN_ENABLED to false (case insensitive). Please file a bug if you encounter any issues with this behavior.

Here is our current use case that broke:

  • The NAT gateway has a public IP and this is effectively the IP that will be the source for gRPC connections from our node.
  • The node is accessible through a different public IP via a AWS Classic Load Balancer. The load balancer acts as firewall, using an attached security group for allowlisting IPS, corresponding to the rest of the servers.
  • The load balancer's IP has the hostname associated to it and performs TLS termination on the incoming gRPC connections.

It is unclear what the path forward is for us in order to support ALPN, and you said to file a bug if we encountered any issues with the new behaviour, so here am I.

But I don't think there's anything to do on your side, except maybe have a migration path / improved documentation around that error message since a little Google search didn't return anything recent about this issue. Hopefully this issue might point developers googling the issue in the right direction.

What did you expect to see?

Nodes continuing to connect to each others just fine.
In case of big breaking changes such as these, some documentation explaining how to typically enable ALPN in a go-grpc deployment, and including common reverse proxies use-cases such as nginx, AWS or GCP.

What did you see instead?

ALPN-related errors and no related documentation or issues about it.

@AnomalRoil
Copy link
Contributor Author

So, as I said above, I wouldn't be shocked this is just closed as a wontfix, but hopefully having the error message in the text should point developers in the right direction if they encounter that in the coming weeks/months after upgrading their dependencies.

On our side, we'll look into enabling ALPN.
In the meantime, using the env variable GRPC_ENFORCE_ALPN_ENABLED to revert the behaviour to the previous one, or just sticking with the previous version should be fine.

@arjan-bal arjan-bal self-assigned this Dec 11, 2024
@arjan-bal
Copy link
Contributor

arjan-bal commented Dec 13, 2024

We will be making the following changes:

  1. Add an experimental Credentials implementation that doesn't enforce ALPN. This can be used for migrations to ALPN enforcement. There will be a note mentioning these credentials will be removed in the future, but users can copy them if they want.
  2. Update the returned error string to be more descriptive, possibly linking to GithHub issue and the experimental credentials.
  3. After a couple of releases, we will remove the experimental credentials and the env variable. Users who still need to disable ALPN can copy the custom credentials.

@purnesh42H purnesh42H added Area: Auth Includes regular credentials API and implementation. Also includes advancedtls, authz, rbac etc. Area: Transport Includes HTTP/2 client/server and HTTP server handler transports and advanced transport features. and removed Area: Auth Includes regular credentials API and implementation. Also includes advancedtls, authz, rbac etc. labels Dec 16, 2024
@arjan-bal arjan-bal added Type: Feature New features or improvements in behavior Area: Auth Includes regular credentials API and implementation. Also includes advancedtls, authz, rbac etc. and removed Type: Bug Area: Transport Includes HTTP/2 client/server and HTTP server handler transports and advanced transport features. labels Dec 16, 2024
@stanhu
Copy link

stanhu commented Dec 24, 2024

Thanks! I'd like to ask to keep a workaround available for a while.

I've been working with a customer that run into downtime due to this change. The customer was using the NGINX Stream module with a config that looked like:

stream {
    upstream backend {
      server <some IP>:2305;
    }

    server {
      listen <some IP>3305;
      proxy_pass backend;
      proxy_ssl on; 
    }
}

Here's the confusing part: the NGINX Stream module is a TCP load balancer that also happens to speak TLS. However, it does not decrypt the application-layer (HTTP/2, gRPC) traffic, so it's still considered a Layer 4 load balancer.

The NGINX stream module has a key setting proxy_ssl. If proxy_ssl is disabled, then NGINX simply proxies the TLS connection to the backend:

sequenceDiagram
    autonumber
    participant Client as Client
    participant NGINX as NGINX Stream Proxy
    participant Backend as Backend

    Client->>NGINX: Establish TLS Session (w/ ALPN Extension)
    NGINX->>Backend: Establish TLS Session (w/ ALPN Extension)
    Client->>NGINX: Encrypted TCP packets 
    NGINX->>Backend: Encrypted TCP packets 
    Backend->>NGINX: Encrypted Response
    NGINX->>Client: Encrypted Response
Loading

However, when the proxy_ssl configuration setting is enabled, NGINX establishes a new TLS connection to the backend, and re-encrypts TCP packets:

sequenceDiagram
    autonumber
    participant Client as Client
    participant NGINX as NGINX Stream Proxy
    participant Backend as Praefect

    Client->>NGINX: Establish TLS Session (w/ ALPN Extension)
    NGINX->>Backend: Establish New TLS Session
    Client->>NGINX: Encrypted TCP packets    
    NGINX-->>NGINX: Decrypt TCP packets
    NGINX->>Backend: Forwarded, re-encrypted TCP packets
    Backend->>NGINX: Encrypted Response
    NGINX->>NGINX: Decrypt Response
    NGINX-->>Client: Re-encrypted TCP packets
Loading

Note on step 2, ALPN is currently not used, and this is why grpc-go v1.67.0 fails to accept the connection.

In 2017, someone proposed a patch to add proxy_ssl_alpn to NGINX, but a NGINX maintainer in https://mailman.nginx.org/pipermail/nginx-devel/2017-July/010307.html responded:

Using ALPN doesn't seem to be needed when working with normal HTTP. On the other hand, we probably should use ALPN automatically when connecting to a HTTP/2 backend over SSL, as per RFC7540 (https://tools.ietf.org/html/rfc7540#section-3.4, "implementations that support HTTP/2 over TLS MUST use protocol negotiation in TLS"). Requiring a user to use an additional option looks strange, not to mention it is non-compliant.

Note that if the native gRPC support is used, ALPN is fully supported. However, due a sidechannel to bypass gRPC for performance reasons, our software can't yet a gRPC load balancer, so we're working on fixing that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Auth Includes regular credentials API and implementation. Also includes advancedtls, authz, rbac etc. Type: Feature New features or improvements in behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants