Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loki-Ruler panic runtime error #15816

Open
slitsevych opened this issue Jan 17, 2025 · 0 comments
Open

Loki-Ruler panic runtime error #15816

slitsevych opened this issue Jan 17, 2025 · 0 comments
Labels
component/ruler type/bug Somehing is not working as expected

Comments

@slitsevych
Copy link

slitsevych commented Jan 17, 2025

Describe the bug

loki-ruler pod periodically crashes with the panic error. Ruler itself is running as 1 replica with persistent storage evaluting 29 alert rules separated into 6 groups. Most of the time it runs just fine but then all of a sudden it crashes with the following error:

panic: runtime error: slice bounds out of range [-3:]

goroutine 3046223 [running]:
github.com/grafana/loki/v3/pkg/logproto.(*SampleQueryRequest).MarshalToSizedBuffer(0xc00b5a5100, {0xc00d389c00, 0x3ad, 0x3ad})
        /src/loki/pkg/logproto/logproto.pb.go:6863 +0x5d8
github.com/grafana/loki/v3/pkg/logproto.(*SampleQueryRequest).Marshal(0xc00b5a5100)
        /src/loki/pkg/logproto/logproto.pb.go:6781 +0x4d
google.golang.org/protobuf/internal/impl.legacyMarshal({{}, {0x43b2760, 0xc00acffc00}, {0x0, 0x0, 0x0}, 0x0})
        /src/loki/vendor/google.golang.org/protobuf/internal/impl/legacy_message.go:411 +0xf2
google.golang.org/protobuf/proto.MarshalOptions.size({{}, 0x0?, 0x51?, 0x5a?}, {0x43b2760, 0xc00acffc00})
        /src/loki/vendor/google.golang.org/protobuf/proto/size.go:44 +0x103
google.golang.org/protobuf/proto.MarshalOptions.Size({{}, 0xe0?, 0xdb?, 0xb1?}, {0x4348300?, 0xc00acffc00?})
        /src/loki/vendor/google.golang.org/protobuf/proto/size.go:26 +0x4c
google.golang.org/protobuf/proto.Size(...)
        /src/loki/vendor/google.golang.org/protobuf/proto/size.go:16
google.golang.org/grpc/encoding/proto.(*codecV2).Marshal(0xc00ddbaa48?, {0x3b1dbe0?, 0xc00b5a5100?})
        /src/loki/vendor/google.golang.org/grpc/encoding/proto/proto.go:49 +0x7a
google.golang.org/grpc.encode({0x7bb1a4fb4b98?, 0x65e4b20?}, {0x3b1dbe0?, 0xc00b5a5100?})
        /src/loki/vendor/google.golang.org/grpc/rpc_util.go:694 +0x4a
google.golang.org/grpc.prepareMsg({0x3b1dbe0?, 0xc00b5a5100?}, {0x7bb1a4fb4b98?, 0x65e4b20?}, {0x0, 0x0}, {0x437d800, 0xc0001cdd10}, {0x436df40, 0xc001029680})
        /src/loki/vendor/google.golang.org/grpc/stream.go:1830 +0xe5
google.golang.org/grpc.(*clientStream).SendMsg(0xc010a08fc0, {0x3b1dbe0, 0xc00b5a5100})
        /src/loki/vendor/google.golang.org/grpc/stream.go:906 +0xf1
github.com/grafana/dskit/middleware.(*instrumentedClientStream).SendMsg(0xc0013d0bd0, {0x3b1dbe0?, 0xc00b5a5100?})
        /src/loki/vendor/github.com/grafana/dskit/middleware/grpc_instrumentation.go:141 +0x32
github.com/grpc-ecosystem/grpc-opentracing/go/otgrpc.(*openTracingClientStream).SendMsg(0xc011754440, {0x3b1dbe0?, 0xc00b5a5100?})
        /src/loki/vendor/github.com/grpc-ecosystem/grpc-opentracing/go/otgrpc/client.go:195 +0x2d
github.com/grafana/loki/v3/pkg/logproto.(*querierClient).QuerySample(0x100c000103808?, {0x4383908?, 0xc0013d08c0?}, 0xc00b5a5100, {0x0?, 0x3301320?, 0x4333030?})
        /src/loki/pkg/logproto/logproto.pb.go:5957 +0xd0
github.com/grafana/loki/v3/pkg/querier.(*IngesterQuerier).SelectSample.func1({0xc0101aa900?, 0x10?}, {0x7bb1a4fb4af8, 0xc00163ae10})
        /src/loki/pkg/querier/ingester_querier.go:173 +0x9c
github.com/grafana/loki/v3/pkg/querier.(*IngesterQuerier).forGivenIngesters.func1({0x4383898, 0xc00ab51cc0}, 0xc01086ca80)
        /src/loki/pkg/querier/ingester_querier.go:134 +0x12a
github.com/grafana/dskit/ring.DoUntilQuorum[...].func1(0xc00ab51cc0?, 0xc01086ca80?)
        /src/loki/vendor/github.com/grafana/dskit/ring/replication_set.go:219 +0x22
github.com/grafana/dskit/ring.DoUntilQuorumWithoutSuccessfulContextCancellation[...].func2()
        /src/loki/vendor/github.com/grafana/dskit/ring/replication_set.go:298 +0xef
created by github.com/grafana/dskit/ring.DoUntilQuorumWithoutSuccessfulContextCancellation[...] in goroutine 803
        /src/loki/vendor/github.com/grafana/dskit/ring/replication_set.go:287 +0x638

Environment:

  • Infrastructure: Kubernetes, GKE
  • Deployment tool: Helm (Microservices/Distributed mode)
  • Loki Chart version: 6.24.0
  • App version: 3.3.2

Ruler configuration
Ruler spec in Helm:

ruler:
  enabled: true
  replicas: 1
  resources:
    requests:
      cpu: 50m
      memory: 256Mi
    limits:
      cpu: 500m
      memory: 4096Mi
  extraEnv:
  - name: GOMEMLIMIT
    value: "3750MiB"
  terminationGracePeriodSeconds: 30
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: node_pool
                operator: In
                values:
                  - "${node_pool}"
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app.kubernetes.io/component: disabled
        topologyKey: kubernetes.io/hostname
  nodeSelector:
    cloud.google.com/gke-nodepool: "${node_pool}"
  tolerations:
  - key: "app"
    operator: "Equal"
    value: "${node_taint}"
    effect: "NoSchedule"
  persistence:
    enabled: true
    size: 4Gi
    storageClass: hyperdisk-storage
  appProtocol:
    grpc: ""
  directories: {}
  extraVolumeMounts:
    - name: sc-rules-volume
      mountPath: "/var/loki/rules/fake"
  extraVolumes:
    - name: sc-rules-volume
      emptyDir: {}
  extraContainers:
  - name: loki-sc-rules
    image: kiwigrid/k8s-sidecar:1.28.0
    imagePullPolicy: Always
    env:
      - name: METHOD
        value: "WATCH"
      - name: NAMESPACE
        value: "ALL"
      - name: LABEL
        value: "loki_rule"
      - name: LABEL_VALUE
        value: "true"
      - name: FOLDER
        value: "/var/loki/rules/fake"
      - name: WATCH_SERVER_TIMEOUT
        value: "60"
      - name: WATCH_CLIENT_TIMEOUT
        value: "60"
      - name: LOG_LEVEL
        value: "INFO"
      - name: RESOURCE
        value: "configmap"
      - name: UNIQUE_FILENAMES
        value: "false"
      - name: SKIP_TLS_VERIFY
        value: "true"
      - name: PYTHONWARNINGS
        value: "ignore:Unverified HTTPS request"
    volumeMounts:
      - name: sc-rules-volume
        mountPath: "/var/loki/rules/fake"

Ruler Config:

  rulerConfig:
    wal:
      dir: /var/loki/ruler-wal
      truncate_frequency: 30m
    wal_cleaner:
      min_age: 12h
      period: 1h
    storage:
      type: local
      local:
        directory: /var/loki/rules
    rule_path: /var/loki/rules/fake
    remote_write:
      enabled: true
      client:
        url: http://prometheus.${private_domain}/api/v1/write
    alertmanager_url: "http://alertmanager.${private_domain}"
    enable_alertmanager_v2: true
    evaluation:
      mode: local
    enable_api: true
    poll_interval: 5m
    notification_queue_capacity: 50000
    notification_timeout: 60s
    for_outage_tolerance: 1h
    search_pending_for: 10m
    flush_period: 5m
    external_url: "https://grafana.${public_domain}"
    external_labels:
      cluster: ${cluster}
      category: logs
    ring:
      kvstore:
        store: memberlist
@JStickler JStickler added component/ruler type/bug Somehing is not working as expected labels Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/ruler type/bug Somehing is not working as expected
Projects
None yet
Development

No branches or pull requests

2 participants