Broken symlink garbage collection issue in /run/crio/ #1497
Replies: 6 comments 7 replies
-
Hello @andrew-wilson-88 , is it possible to get a must-gather from the cluster and the journal of a node that is affected by this problem? See https://docs.okd.io/latest/support/gathering-cluster-data.html Also, could you send some partial output of these commands executed in one of the affected nodes?
|
Beta Was this translation helpful? Give feedback.
-
Hi @aleskandro, Of course, please find the must gather and export of the journal on the links below: Journal: The output of the commands listed are as follows, I've also added an excerpt of the broken symlinks and where they are targeted for reference:
Please let me know if there is anything else I can provide in regards to this matter. Regards, |
Beta Was this translation helpful? Give feedback.
-
Hello everyone, I'm experiencing the exact same issue on OKD 4.7.0-0.okd-2021-07-03-190901. I'm seeing a large number of inodes being used under /run/crio and /run/crio/exits. |
Beta Was this translation helpful? Give feedback.
-
Hi all, I apologize for the delay in the response. Thanks for the provided info. This issue seems reproducible in the 4.13 nighlies as well, with the steps in the next. When pods are deleted, Environment tested with the following versions for crictl, conmon and runc:
Steps to reproduce
NODE=node1
oc adm taint nodes $NODE conmon-bug=value:NoSchedule
oc debug node/$NODE
chroot /host
watch 'find /run/crio -type l ! -readable | wc -l'
oc new-project my-project
oc create -f - <<EOF
kind: Pod
apiVersion: v1
metadata:
generateName: example-
labels:
app: hello
spec:
nodeSelector:
kubernetes.io/hostname: $NODE
tolerations:
- key: "conmon-bug"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: hello
securityContext:
allowPrivilegeEscalation: false
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
capabilities:
drop:
- ALL
image: image-registry.openshift-image-registry.svc:5000/openshift/cli
imagePullPolicy: IfNotPresent
command:
- /bin/sh
args:
- '-c'
- date; echo Hello from the Kubernetes cluster
restartPolicy: Never
EOF
Instead of waiting at point 3, you can just delete the container with The broken symlink is created by
|
Beta Was this translation helpful? Give feedback.
-
Proposed fix: containers/conmon#384 |
Beta Was this translation helpful? Give feedback.
-
Hello, I've tried to follow the merge trail but couldn't find if this has been updated in a new release as of yet. Is there a particular version that the patch has been introduced in? Regards, |
Beta Was this translation helpful? Give feedback.
-
Describe the bug
Every few weeks we are seeing seemingly broken symlinks to overlay volumes in the /run/crio/ directory. Eventually this leads to the Inodes on the partition being depleted, resulting in new pods not being able to start after being scheduled.
We do have multiple CronJobs that connect to EBS PV's running each minute so it could be an issue in garbage collection as the symlinks seem to volumes which no longer exist.
Has anybody seen similar, or is able provide any further information into how these symlinks tie in to everything, as some seem to remove themselves and others can stay in place for months for pods which haven't been running for some time.
Version
We've seen this behaviour in version 4.8 - 4.11 on AWS installed using the OKD installer.
How reproducible
This has occurred on multiple nodes for over a year.
Beta Was this translation helpful? Give feedback.
All reactions