Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io error: broken pipe when accessing redis-backed KV store #2974

Closed
tfenster opened this issue Jan 5, 2025 · 30 comments · Fixed by #2976
Closed

io error: broken pipe when accessing redis-backed KV store #2974

tfenster opened this issue Jan 5, 2025 · 30 comments · Fixed by #2976

Comments

@tfenster
Copy link

tfenster commented Jan 5, 2025

I have an application (tfenster/verified-bluesky) that worked fine on your cloud, but due to the limit of 1024 entries to the KV store, I moved it to Azure (Azure Kubernetes Service, Azure Cache for Redis). Since then, it works for some time (didn't exactly check, but seems like a few minutes), but then calls to the KV store fail with "io error: broken pipe". I have narrowed it down to the following piece of code as the easiest repro, but other places accessing the store also fail with the same error:

package main

import (
	"encoding/json"
	"fmt"
	"net/http"
	"strings"

	spinhttp "github.com/fermyon/spin/sdk/go/v2/http"
	"github.com/fermyon/spin/sdk/go/v2/kv"
)

func init() {
	spinhttp.Handle(func(w http.ResponseWriter, r *http.Request) {
		switch r.Method {
		case http.MethodGet:
			fmt.Println("Getting stats")
			store, err := kv.OpenStore("default")
			if err != nil {
				http.Error(w, "Error opening kv store "+err.Error(), http.StatusInternalServerError)
				return
			}
			defer store.Close()

			fmt.Println("Getting keys")
			keys, err := store.GetKeys()
			if err != nil {
				http.Error(w, "Error getting keys from kv store "+err.Error(), http.StatusInternalServerError)
				return
			}
...
		}
	})
}

func main() {}

The call to kv.OpenStore still seems to work, but store.GetKeys() fails and I get this response:

HTTP/1.1 500 Internal Server Error
Date: Sun, 05 Jan 2025 08:23:35 GMT
Content-Type: text/plain; charset=utf-8
Content-Length: 55
Connection: close
x-content-type-options: nosniff
Strict-Transport-Security: max-age=31536000; includeSubDomains

Error getting keys from kv store io error: broken pipe

For full context, see main.go. The logs only show this

Getting stats
Getting keys

The setup of the redis backend and creation of the runtime config looks like this

az redis create -n $redisName -g $rgAndClusterName -l $location --sku Basic --vm-size c0 --redis-configuration '{ "maxmemory-policy": "noeviction" }'
$redisKeys = $(az redis list-keys -n $redisName -g $rgAndClusterName) | ConvertFrom-Json

@"
[key_value_store.default]
type = "redis"
url = "rediss://:$($redisKeys.primaryKey)@$redisName.redis.cache.windows.net:6380"
"@ | Out-File ./runtime-config-redis.toml

I then bring that to this secret (content obviously redacted)

apiVersion: v1
kind: Secret
metadata:
  name: verified-bluesky-runtime-config
type: Opaque
data:
  runtime-config.toml: ...

Deploying the spin app looks like this spinapp.yaml and the latest build to create and push the OCI artifact is here.

Let me know if I can share anything else

Not sure if that is relevant, but this is the output of the commands you asked for in the template, taken from my dev environment:

  • Spin version (spin --version):
vscode ➜ /workspaces/verified-bluesky/deployment (main) $ spin --version
spin 3.1.1 (aa919ce 2024-12-20)
  • Installed plugin versions (spin plugins list --installed)
vscode ➜ /workspaces/verified-bluesky/deployment (main) $ spin plugins list --installed
cloud 0.10.0 [installed]
kube 0.3.1 [installed]
@tfenster
Copy link
Author

tfenster commented Jan 5, 2025

As additional context, the Redis side shows no issues, at least not in the Azure dashboard:

image

@tfenster
Copy link
Author

tfenster commented Jan 5, 2025

I have now tried to

  • add the redis URL to allowed_outbound_hosts
  • go from 2 to 1 replica
  • remove no-debug from the build command

Still the same error and no additional output in the logs. Removing the pod and letting it get recreated solves it until it happens again

@tfenster
Copy link
Author

tfenster commented Jan 5, 2025

Adding a panic(err) call also didn't bring any relevant information, I think:

2025-01-05T16:41:59.156138352Z panic: io error: broken pipe
2025-01-05T16:41:59.156286164Z ERROR spin_trigger_http::server: Error processing request: error while executing at wasm backtrace:
2025-01-05T16:41:59.156295489Z     0: 0x3891 - main!runtime._panic.llvm.2513346125720970973
2025-01-05T16:41:59.156298183Z     1: 0x47483 - main!spin_http_handle_http_request
2025-01-05T16:41:59.156301403Z     2: 0x48223 - main!__wasm_export_spin_http_handle_http_request
2025-01-05T16:41:59.156307315Z     3: 0x13cea3 - wit-component:adapter:wasi_snapshot_preview1!fermyon:spin/inbound-http#handle-request
2025-01-05T16:41:59.156311020Z note: using the `WASMTIME_BACKTRACE_DETAILS=1` environment variable may show more debugging information
2025-01-05T16:41:59.156322874Z
2025-01-05T16:41:59.156327104Z Caused by:
2025-01-05T16:41:59.156329370Z     wasm trap: wasm `unreachable` instruction executed

@itowlson
Copy link
Contributor

itowlson commented Jan 5, 2025

cc @ThorstenHans for Azure wisdom and @fibonacci1729 for Redis KV knowledge

@ThorstenHans
Copy link
Contributor

@tfenster Thanks for the detailed error description and all the information for reproducing this one.

I did the following:

  • created a new AKS cluster and a new Azure Cache for Redis using the settings provided
  • deployed latest SpinKube
  • took your Go code to build and deploy a Spin App

I see all requests resulting in a HTTP 200 (I simply return the list of keys as response). I also used hey to put some load on the app. When sending 100k requests to the app I see all calls succeed.

With runtimes older than Spin 2.8.0, we had a similar issue in the context of sqlite (see fermyon/enterprise-architectures-and-patterns#18).

IIRC, containerd-shim-spin was updated to the 3.0.0 runtime by end of November. The node-installer image is responsible for deploying the runtime to your Kubernetes nodes...

That said, could you check which node-installer image is used for deploying containerd-shim-spin to your Kubernetes nodes?

You can find the version by executing this command:

KWASM_NS=kwasm
KWASM_RELEASE_NAME=kwasm-operator

helm get values -n $KWASM_NS $KWASM_RELEASE_NAME

My setup works with ghcr.io/spinkube/containerd-shim-spin/node-installer:v0.17.0

@tfenster
Copy link
Author

tfenster commented Jan 6, 2025

@ThorstenHans can you interact with it, then let it run for a bit and then try again? It works for me initially as well, but after some time (maybe 15 minutes, not sure), it breaks. I would also be happy to create an instance for you and give you full access, if that helps

The node installer image is the same as for you, I have created the cluster only two days ago

helm get values -n kwasm kwasm-operator                    
USER-SUPPLIED VALUES:                                                          
kwasmOperator:
  installerImage: ghcr.io/spinkube/containerd-shim-spin/node-installer:v0.17.0

One other thought: Which go and tinygo versions are you using? Might that be a cause for issues?

@ThorstenHans
Copy link
Contributor

That's the trick. After waiting some time I see the same error. I'll try to spot how / where this happens

@tfenster
Copy link
Author

tfenster commented Jan 6, 2025

Great, thanks

@ThorstenHans
Copy link
Contributor

@tfenster you could use CosmosDB as KV store, I tested it with the constraints mentioned above. And it works (even if the app is on idle for quite some time)

With a CosmosDB instance deployed, you create a database and a container using your desired configuration, the only requirement is that the partition key is set to /id.

The runtime configuration file should look similar to this:

[key_value_store.default]
type = "azure_cosmos"
key = "your_key_here=="
account = "your_account_name"
database = "your_db_name"
container = "your_container_name"

@fibonacci1729 this is a good indicator that there is something broken in the factor for redis-kv

@tfenster
Copy link
Author

tfenster commented Jan 6, 2025

@ThorstenHans that's what I first tried, but I ran into the RU limits very quickly for the cheap instances, so I would prefer Redis

@fibonacci1729
Copy link
Contributor

At a glance, this seems like a quirk of using redis::aio::MultiplexedConnection in the kv redis factor, possibly in conjunction with how the redis instance timeout is configured with defaults. I'll poke a little deeper to understand what's going on here.

@tfenster
Copy link
Author

tfenster commented Jan 6, 2025

not sure if and where I can configure it, but what would be a timeout that could fix this?

@fibonacci1729
Copy link
Contributor

fibonacci1729 commented Jan 6, 2025

Actually after a slightly longer glance, I think the issue is that redis::aio::MultiplexedConnection doesn't reconnect after idling/disconnect (AFAICT from the docs). The suggested way to acheive this is to use the ConnectionManager configured with retry/backoff enabled.

Re: a reasonable timeout on the server, i'm not entirely sure. I think any reasonable timeout would just be delaying the behavior you're currently seeing.

@tfenster
Copy link
Author

tfenster commented Jan 6, 2025

ok, then I'll patiently wait for a fix 😊 Thanks a lot for the quick check!

@tfenster
Copy link
Author

tfenster commented Jan 9, 2025

@fibonacci1729 Would you already have an idea if and when a fix for this issue could become available?

@fibonacci1729
Copy link
Contributor

Hey @tfenster! Just context switching to this now. I'll have a better idea by the EOD once i can reproduce and poke around a bit to confirm the fix. I'll update here.

fibonacci1729 added a commit to fibonacci1729/spin that referenced this issue Jan 9, 2025
fibonacci1729 added a commit that referenced this issue Jan 10, 2025
fix #2974 -- use redis::aio::ConnectionManager
@tfenster
Copy link
Author

Thanks a lot @fibonacci1729 for working on it so quickly! Do I now need to wait for a release or can I already somehow get the fix?

@ThorstenHans
Copy link
Contributor

Thanks a lot @fibonacci1729 for working on it so quickly! Do I now need to wait for a release or can I already somehow get the fix?

First, a new Spin release must be cut, containerd-shim-spin must update the corresponding dependency, and also cut a release. I'll reach out to corresponding people and try to get a timeframe for those.

@fibonacci1729
Copy link
Contributor

Hey @tfenster! We will kick off a patch release of Spin Monday after the AM public Spin meeting which should be available by the EOD. I imagine getting a release of the containerd-shim-spin should follow a similar timeline the next day. We'll try to get you situated by mid week next week!

@tfenster
Copy link
Author

Great, thanks!

@radu-matei
Copy link
Member

Mentioning that we also have canary builds of Spin, so you could at least try things out locally by grabbing Spin from https://github.com/fermyon/spin/releases.

@tfenster
Copy link
Author

Thanks for the hint @radu-matei! I tried and just to confirm your fix, @fibonacci1729, it works locally now for > 1h with the canary build. The latest release would have definitely broken in that time, so looks like it is fixed 🎉

@radu-matei
Copy link
Member

That’s great to hear, thanks for the confirmation, @tfenster!

Enjoy your weekend!

fibonacci1729 added a commit that referenced this issue Jan 13, 2025
Signed-off-by: Brian H <[email protected]>
(cherry picked from commit ec11ba2)
fibonacci1729 added a commit that referenced this issue Jan 13, 2025
[Backport v3.1] fix #2974 -- use redis::aio::ConnectionManager
@ThorstenHans
Copy link
Contributor

@tfenster quick update, @kate-goldenring created a corresponding PR over on containerd-shim-spin to land those changes there soon -> spinkube/containerd-shim-spin#263

@kate-goldenring
Copy link
Contributor

Here is an updated node installer image you can try out. It has the Spin patch in it:

# Remove annotation from all nodes so can reapply it later to trigger a new install of the shim
kubectl annotate node --all kwasm.sh/kwasm-node=-
# Upgrade node installer to latest shim
helm upgrade --install \
  kwasm-operator kwasm/kwasm-operator \
  --namespace kwasm \
  --create-namespace \
  --set "kwasmOperator.installerImage=ghcr.io/spinkube/containerd-shim-spin/node-installer:20250114-015325-gf938c61"
# Reapply annotation to provision all nodes
kubectl annotate node --all kwasm.sh/kwasm-node=true

You should see a new kwasm job start and complete for each node.

@tfenster
Copy link
Author

@kate-goldenring Thanks, I'll try and get back. Do I also need to rebuild my application with the new spin version?

@kate-goldenring
Copy link
Contributor

@tfenster you can use your existing app -- no need to rebuild. It will use the newly updated host components of the Spin runtime.

@tfenster
Copy link
Author

tfenster commented Jan 14, 2025

Looks good! It works now for ~1h, the previous version never made it so long. Thanks for the quick solution and turnaround to everyone involved! @itowlson @ThorstenHans @fibonacci1729 @radu-matei @kate-goldenring

@kate-goldenring
Copy link
Contributor

Great to hear it is working! We also released the shim with a fix so you can update to shim v0.18.0 as well:

# Remove annotation from all nodes so can reapply it later to trigger a new install of the shim
kubectl annotate node --all kwasm.sh/kwasm-node=-
# Upgrade node installer to latest shim
helm upgrade --install \
  kwasm-operator kwasm/kwasm-operator \
  --namespace kwasm \
  --create-namespace \
  --set "kwasmOperator.installerImage=ghcr.io/spinkube/containerd-shim-spin/node-installer:v0.18.0"
# Reapply annotation to provision all nodes
kubectl annotate node --all kwasm.sh/kwasm-node=true

@tfenster
Copy link
Author

Did that as well, thanks. A little nit to add in case anyone comes across this: The kubectl annotate command also needed --overwrite for me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants