io error: broken pipe when accessing redis-backed KV store #2974

tfenster · 2025-01-05T08:37:54Z

I have an application (tfenster/verified-bluesky) that worked fine on your cloud, but due to the limit of 1024 entries to the KV store, I moved it to Azure (Azure Kubernetes Service, Azure Cache for Redis). Since then, it works for some time (didn't exactly check, but seems like a few minutes), but then calls to the KV store fail with "io error: broken pipe". I have narrowed it down to the following piece of code as the easiest repro, but other places accessing the store also fail with the same error:

package main

import (
	"encoding/json"
	"fmt"
	"net/http"
	"strings"

	spinhttp "github.com/fermyon/spin/sdk/go/v2/http"
	"github.com/fermyon/spin/sdk/go/v2/kv"
)

func init() {
	spinhttp.Handle(func(w http.ResponseWriter, r *http.Request) {
		switch r.Method {
		case http.MethodGet:
			fmt.Println("Getting stats")
			store, err := kv.OpenStore("default")
			if err != nil {
				http.Error(w, "Error opening kv store "+err.Error(), http.StatusInternalServerError)
				return
			}
			defer store.Close()

			fmt.Println("Getting keys")
			keys, err := store.GetKeys()
			if err != nil {
				http.Error(w, "Error getting keys from kv store "+err.Error(), http.StatusInternalServerError)
				return
			}
...
		}
	})
}

func main() {}

The call to kv.OpenStore still seems to work, but store.GetKeys() fails and I get this response:

HTTP/1.1 500 Internal Server Error
Date: Sun, 05 Jan 2025 08:23:35 GMT
Content-Type: text/plain; charset=utf-8
Content-Length: 55
Connection: close
x-content-type-options: nosniff
Strict-Transport-Security: max-age=31536000; includeSubDomains

Error getting keys from kv store io error: broken pipe

For full context, see main.go. The logs only show this

Getting stats
Getting keys

The setup of the redis backend and creation of the runtime config looks like this

az redis create -n $redisName -g $rgAndClusterName -l $location --sku Basic --vm-size c0 --redis-configuration '{ "maxmemory-policy": "noeviction" }'
$redisKeys = $(az redis list-keys -n $redisName -g $rgAndClusterName) | ConvertFrom-Json

@"
[key_value_store.default]
type = "redis"
url = "rediss://:$($redisKeys.primaryKey)@$redisName.redis.cache.windows.net:6380"
"@ | Out-File ./runtime-config-redis.toml

I then bring that to this secret (content obviously redacted)

apiVersion: v1
kind: Secret
metadata:
  name: verified-bluesky-runtime-config
type: Opaque
data:
  runtime-config.toml: ...

Deploying the spin app looks like this spinapp.yaml and the latest build to create and push the OCI artifact is here.

Let me know if I can share anything else

Not sure if that is relevant, but this is the output of the commands you asked for in the template, taken from my dev environment:

Spin version (spin --version):

vscode ➜ /workspaces/verified-bluesky/deployment (main) $ spin --version
spin 3.1.1 (aa919ce 2024-12-20)

Installed plugin versions (spin plugins list --installed)

vscode ➜ /workspaces/verified-bluesky/deployment (main) $ spin plugins list --installed
cloud 0.10.0 [installed]
kube 0.3.1 [installed]

The text was updated successfully, but these errors were encountered:

tfenster · 2025-01-05T08:45:11Z

As additional context, the Redis side shows no issues, at least not in the Azure dashboard:

tfenster · 2025-01-05T13:23:38Z

I have now tried to

add the redis URL to allowed_outbound_hosts
go from 2 to 1 replica
remove no-debug from the build command

Still the same error and no additional output in the logs. Removing the pod and letting it get recreated solves it until it happens again

tfenster · 2025-01-05T16:52:09Z

Adding a panic(err) call also didn't bring any relevant information, I think:

2025-01-05T16:41:59.156138352Z panic: io error: broken pipe
2025-01-05T16:41:59.156286164Z ERROR spin_trigger_http::server: Error processing request: error while executing at wasm backtrace:
2025-01-05T16:41:59.156295489Z     0: 0x3891 - main!runtime._panic.llvm.2513346125720970973
2025-01-05T16:41:59.156298183Z     1: 0x47483 - main!spin_http_handle_http_request
2025-01-05T16:41:59.156301403Z     2: 0x48223 - main!__wasm_export_spin_http_handle_http_request
2025-01-05T16:41:59.156307315Z     3: 0x13cea3 - wit-component:adapter:wasi_snapshot_preview1!fermyon:spin/inbound-http#handle-request
2025-01-05T16:41:59.156311020Z note: using the `WASMTIME_BACKTRACE_DETAILS=1` environment variable may show more debugging information
2025-01-05T16:41:59.156322874Z
2025-01-05T16:41:59.156327104Z Caused by:
2025-01-05T16:41:59.156329370Z     wasm trap: wasm `unreachable` instruction executed

itowlson · 2025-01-05T20:47:12Z

cc @ThorstenHans for Azure wisdom and @fibonacci1729 for Redis KV knowledge

ThorstenHans · 2025-01-06T10:08:53Z

@tfenster Thanks for the detailed error description and all the information for reproducing this one.

I did the following:

created a new AKS cluster and a new Azure Cache for Redis using the settings provided
deployed latest SpinKube
took your Go code to build and deploy a Spin App

I see all requests resulting in a HTTP 200 (I simply return the list of keys as response). I also used hey to put some load on the app. When sending 100k requests to the app I see all calls succeed.

With runtimes older than Spin 2.8.0, we had a similar issue in the context of sqlite (see fermyon/enterprise-architectures-and-patterns#18).

IIRC, containerd-shim-spin was updated to the 3.0.0 runtime by end of November. The node-installer image is responsible for deploying the runtime to your Kubernetes nodes...

That said, could you check which node-installer image is used for deploying containerd-shim-spin to your Kubernetes nodes?

You can find the version by executing this command:

KWASM_NS=kwasm
KWASM_RELEASE_NAME=kwasm-operator

helm get values -n $KWASM_NS $KWASM_RELEASE_NAME

My setup works with ghcr.io/spinkube/containerd-shim-spin/node-installer:v0.17.0

tfenster · 2025-01-06T10:58:41Z

@ThorstenHans can you interact with it, then let it run for a bit and then try again? It works for me initially as well, but after some time (maybe 15 minutes, not sure), it breaks. I would also be happy to create an instance for you and give you full access, if that helps

The node installer image is the same as for you, I have created the cluster only two days ago

helm get values -n kwasm kwasm-operator                    
USER-SUPPLIED VALUES:                                                          
kwasmOperator:
  installerImage: ghcr.io/spinkube/containerd-shim-spin/node-installer:v0.17.0

One other thought: Which go and tinygo versions are you using? Might that be a cause for issues?

ThorstenHans · 2025-01-06T11:22:46Z

That's the trick. After waiting some time I see the same error. I'll try to spot how / where this happens

tfenster · 2025-01-06T11:30:16Z

Great, thanks

ThorstenHans · 2025-01-06T14:25:22Z

@tfenster you could use CosmosDB as KV store, I tested it with the constraints mentioned above. And it works (even if the app is on idle for quite some time)

With a CosmosDB instance deployed, you create a database and a container using your desired configuration, the only requirement is that the partition key is set to /id.

The runtime configuration file should look similar to this:

[key_value_store.default]
type = "azure_cosmos"
key = "your_key_here=="
account = "your_account_name"
database = "your_db_name"
container = "your_container_name"

@fibonacci1729 this is a good indicator that there is something broken in the factor for redis-kv

tfenster · 2025-01-06T14:28:25Z

@ThorstenHans that's what I first tried, but I ran into the RU limits very quickly for the cheap instances, so I would prefer Redis

fibonacci1729 · 2025-01-06T17:52:21Z

At a glance, this seems like a quirk of using redis::aio::MultiplexedConnection in the kv redis factor, possibly in conjunction with how the redis instance timeout is configured with defaults. I'll poke a little deeper to understand what's going on here.

tfenster · 2025-01-06T18:14:08Z

not sure if and where I can configure it, but what would be a timeout that could fix this?

fibonacci1729 · 2025-01-06T18:17:22Z

Actually after a slightly longer glance, I think the issue is that redis::aio::MultiplexedConnection doesn't reconnect after idling/disconnect (AFAICT from the docs). The suggested way to acheive this is to use the ConnectionManager configured with retry/backoff enabled.

Re: a reasonable timeout on the server, i'm not entirely sure. I think any reasonable timeout would just be delaying the behavior you're currently seeing.

tfenster · 2025-01-06T18:31:17Z

ok, then I'll patiently wait for a fix 😊 Thanks a lot for the quick check!

tfenster · 2025-01-09T20:18:05Z

@fibonacci1729 Would you already have an idea if and when a fix for this issue could become available?

fibonacci1729 · 2025-01-09T20:46:33Z

Hey @tfenster! Just context switching to this now. I'll have a better idea by the EOD once i can reproduce and poke around a bit to confirm the fix. I'll update here.

Signed-off-by: Brian H <[email protected]>

fix #2974 -- use redis::aio::ConnectionManager

tfenster · 2025-01-10T06:25:09Z

Thanks a lot @fibonacci1729 for working on it so quickly! Do I now need to wait for a release or can I already somehow get the fix?

ThorstenHans · 2025-01-10T06:49:47Z

Thanks a lot @fibonacci1729 for working on it so quickly! Do I now need to wait for a release or can I already somehow get the fix?

First, a new Spin release must be cut, containerd-shim-spin must update the corresponding dependency, and also cut a release. I'll reach out to corresponding people and try to get a timeframe for those.

fibonacci1729 · 2025-01-10T18:45:38Z

Hey @tfenster! We will kick off a patch release of Spin Monday after the AM public Spin meeting which should be available by the EOD. I imagine getting a release of the containerd-shim-spin should follow a similar timeline the next day. We'll try to get you situated by mid week next week!

tfenster · 2025-01-10T19:22:26Z

Great, thanks!

radu-matei · 2025-01-10T23:36:30Z

Mentioning that we also have canary builds of Spin, so you could at least try things out locally by grabbing Spin from https://github.com/fermyon/spin/releases.

tfenster · 2025-01-11T13:31:06Z

Thanks for the hint @radu-matei! I tried and just to confirm your fix, @fibonacci1729, it works locally now for > 1h with the canary build. The latest release would have definitely broken in that time, so looks like it is fixed 🎉

radu-matei · 2025-01-11T13:43:33Z

That’s great to hear, thanks for the confirmation, @tfenster!

Enjoy your weekend!

Signed-off-by: Brian H <[email protected]> (cherry picked from commit ec11ba2)

[Backport v3.1] fix #2974 -- use redis::aio::ConnectionManager

ThorstenHans · 2025-01-14T08:37:37Z

@tfenster quick update, @kate-goldenring created a corresponding PR over on containerd-shim-spin to land those changes there soon -> spinkube/containerd-shim-spin#263

kate-goldenring · 2025-01-14T15:56:04Z

Here is an updated node installer image you can try out. It has the Spin patch in it:

# Remove annotation from all nodes so can reapply it later to trigger a new install of the shim
kubectl annotate node --all kwasm.sh/kwasm-node=-
# Upgrade node installer to latest shim
helm upgrade --install \
  kwasm-operator kwasm/kwasm-operator \
  --namespace kwasm \
  --create-namespace \
  --set "kwasmOperator.installerImage=ghcr.io/spinkube/containerd-shim-spin/node-installer:20250114-015325-gf938c61"
# Reapply annotation to provision all nodes
kubectl annotate node --all kwasm.sh/kwasm-node=true

You should see a new kwasm job start and complete for each node.

tfenster · 2025-01-14T17:37:33Z

@kate-goldenring Thanks, I'll try and get back. Do I also need to rebuild my application with the new spin version?

kate-goldenring · 2025-01-14T17:43:36Z

@tfenster you can use your existing app -- no need to rebuild. It will use the newly updated host components of the Spin runtime.

tfenster · 2025-01-14T19:03:42Z

Looks good! It works now for ~1h, the previous version never made it so long. Thanks for the quick solution and turnaround to everyone involved! @itowlson @ThorstenHans @fibonacci1729 @radu-matei @kate-goldenring

kate-goldenring · 2025-01-15T22:22:27Z

Great to hear it is working! We also released the shim with a fix so you can update to shim v0.18.0 as well:

# Remove annotation from all nodes so can reapply it later to trigger a new install of the shim
kubectl annotate node --all kwasm.sh/kwasm-node=-
# Upgrade node installer to latest shim
helm upgrade --install \
  kwasm-operator kwasm/kwasm-operator \
  --namespace kwasm \
  --create-namespace \
  --set "kwasmOperator.installerImage=ghcr.io/spinkube/containerd-shim-spin/node-installer:v0.18.0"
# Reapply annotation to provision all nodes
kubectl annotate node --all kwasm.sh/kwasm-node=true

tfenster · 2025-01-16T09:09:56Z

Did that as well, thanks. A little nit to add in case anyone comes across this: The kubectl annotate command also needed --overwrite for me

fibonacci1729 added a commit to fibonacci1729/spin that referenced this issue Jan 9, 2025

fix fermyon#2974 -- use redis::aio::ConnectionManager

74eeadf

Signed-off-by: Brian H <[email protected]>

fibonacci1729 mentioned this issue Jan 9, 2025

fix #2974 -- use redis::aio::ConnectionManager #2976

Merged

fibonacci1729 closed this as completed in #2976 Jan 10, 2025

fibonacci1729 closed this as completed in ec11ba2 Jan 10, 2025

fibonacci1729 added a commit that referenced this issue Jan 10, 2025

Merge pull request #2976 from fibonacci1729/fix-2974

e8edf4d

fix #2974 -- use redis::aio::ConnectionManager

fibonacci1729 added a commit that referenced this issue Jan 13, 2025

fix #2974 -- use redis::aio::ConnectionManager

a02e430

Signed-off-by: Brian H <[email protected]> (cherry picked from commit ec11ba2)

fibonacci1729 added a commit that referenced this issue Jan 13, 2025

Merge pull request #2978 from fermyon/backport-2976-to-v3.1

9c34ab5

[Backport v3.1] fix #2974 -- use redis::aio::ConnectionManager

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

io error: broken pipe when accessing redis-backed KV store #2974

io error: broken pipe when accessing redis-backed KV store #2974

tfenster commented Jan 5, 2025

tfenster commented Jan 5, 2025

tfenster commented Jan 5, 2025 •

edited

Loading

tfenster commented Jan 5, 2025

itowlson commented Jan 5, 2025

ThorstenHans commented Jan 6, 2025

tfenster commented Jan 6, 2025 •

edited

Loading

ThorstenHans commented Jan 6, 2025

tfenster commented Jan 6, 2025

ThorstenHans commented Jan 6, 2025

tfenster commented Jan 6, 2025

fibonacci1729 commented Jan 6, 2025

tfenster commented Jan 6, 2025

fibonacci1729 commented Jan 6, 2025 •

edited

Loading

tfenster commented Jan 6, 2025

tfenster commented Jan 9, 2025

fibonacci1729 commented Jan 9, 2025

tfenster commented Jan 10, 2025

ThorstenHans commented Jan 10, 2025

fibonacci1729 commented Jan 10, 2025

tfenster commented Jan 10, 2025

radu-matei commented Jan 10, 2025

tfenster commented Jan 11, 2025

radu-matei commented Jan 11, 2025

ThorstenHans commented Jan 14, 2025

kate-goldenring commented Jan 14, 2025

tfenster commented Jan 14, 2025

kate-goldenring commented Jan 14, 2025

tfenster commented Jan 14, 2025 •

edited

Loading

kate-goldenring commented Jan 15, 2025

tfenster commented Jan 16, 2025

io error: broken pipe when accessing redis-backed KV store #2974

io error: broken pipe when accessing redis-backed KV store #2974

Comments

tfenster commented Jan 5, 2025

tfenster commented Jan 5, 2025

tfenster commented Jan 5, 2025 • edited Loading

tfenster commented Jan 5, 2025

itowlson commented Jan 5, 2025

ThorstenHans commented Jan 6, 2025

tfenster commented Jan 6, 2025 • edited Loading

ThorstenHans commented Jan 6, 2025

tfenster commented Jan 6, 2025

ThorstenHans commented Jan 6, 2025

tfenster commented Jan 6, 2025

fibonacci1729 commented Jan 6, 2025

tfenster commented Jan 6, 2025

fibonacci1729 commented Jan 6, 2025 • edited Loading

tfenster commented Jan 6, 2025

tfenster commented Jan 9, 2025

fibonacci1729 commented Jan 9, 2025

tfenster commented Jan 10, 2025

ThorstenHans commented Jan 10, 2025

fibonacci1729 commented Jan 10, 2025

tfenster commented Jan 10, 2025

radu-matei commented Jan 10, 2025

tfenster commented Jan 11, 2025

radu-matei commented Jan 11, 2025

ThorstenHans commented Jan 14, 2025

kate-goldenring commented Jan 14, 2025

tfenster commented Jan 14, 2025

kate-goldenring commented Jan 14, 2025

tfenster commented Jan 14, 2025 • edited Loading

kate-goldenring commented Jan 15, 2025

tfenster commented Jan 16, 2025

tfenster commented Jan 5, 2025 •

edited

Loading

tfenster commented Jan 6, 2025 •

edited

Loading

fibonacci1729 commented Jan 6, 2025 •

edited

Loading

tfenster commented Jan 14, 2025 •

edited

Loading