CASMTRIAGE-7308: Add Troubleshooting Steps #5430
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Istio Pods Crashing post upgrade due to
fs.inotify
LimitsDescription
CASMTRIAGE-7308
After the Istio upgrade, the nodes have not yet been rebooted into a new image where these limits (fs.inotify.max_user_instances and fs.inotify.max_user_watches) have been increased. As a result, when the pods are restarted, they might be trying to monitor more files or create more inotify instances than allowed by the system. This can cause the pods to crash or fail because they are unable to watch the files or directories they need, which may be critical for service mesh operations like traffic management, logging, or configuration changes.
In addition, power outage or node reboot mid-upgrade would hit the same problem because the pods would restart without the required kernel parameters being updated on the nodes.
Relates to:
Checklist
.github/CODEOWNERS
with the corresponding team in Cray-HPE.