Add Troubleshooting Steps

* Istio Pods Crashing post upgrade due to `fs.inotify` Limits
Cray-HPE · Oct 4, 2024 · 414a211 · 414a211
1 parent d643e2e
commit 414a211
Showing 1 changed file with 42 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -54,6 +54,48 @@ manifestgen -c customizations.yaml -i sysmgmt.yaml -o sysmgmt-cust.yaml
 ```sh 
 loftsman ship --charts-path <tgz-charts-path> --manifest-path sysmgmt-cust.yaml 
 ```
+Here’s a suggested `README.md` section outlining the troubleshooting steps and the workaround for the inotify issue after the Istio upgrade:
+
+---
+
+## Troubleshooting: Istio Pods Crashing After Upgrade Due to `fs.inotify` Limits
+
+### Issue Description:
+After upgrading Istio, nodes may not have booted into the new image with the increased `fs.inotify.max_user_instances` and `fs.inotify.max_user_watches` values. When pods restart (such as after a power outage or node reboot), they may fail due to insufficient inotify resources, as the limits on the system are too low.
+
+This issue manifests when:
+- Pods are unable to create enough inotify instances to monitor required files.
+- The system hits the maximum number of file watches, causing crashes or failures in services dependent on file system event monitoring.
+
+This problem can be triggered by events like:
+- A node dying and rebooting mid-upgrade.
+- Power outages where pods restart on nodes with old kernel settings.
+
+### Related Issue:
+
+- [Istio Issue #35829](https://github.com/istio/istio/issues/35829)
+
+The problem was worked around by following the advice to increase `fs.inotify.max_user_instances` from 128 to 1024 as suggested in the GitHub thread.
+
+### Steps to Reproduce:
+1. Upgrade Istio without rebooting nodes into an image with updated kernel parameter values.
+2. Trigger a pod restart or node reboot.
+3. Observe pods failing to start due to inotify-related errors.
+
+### Workaround:
+
+Manually increase the `fs.inotify.max_user_instances` and `fs.inotify.max_user_watches` values to provide sufficient resources for Istio and other Kubernetes components.
+
+```bash
+pdsh -w ncn-m00[1-3],ncn-w00[1-5] 'sysctl -w fs.inotify.max_user_instances=1024'
+pdsh -w ncn-m00[1-3],ncn-w00[1-5] 'sysctl -w user.max_inotify_instances=1024'
+pdsh -w ncn-m00[1-3],ncn-w00[1-5] 'sysctl -w fs.inotify.max_user_watches=1048576'
+pdsh -w ncn-m00[1-3],ncn-w00[1-5] 'sysctl -w user.max_inotify_watches=1048576'
+```
+
+This increases the limits from the default values to higher values, allowing for more inotify instances and file watches.
+
+---
 
 ## Contributing
 See the <a href="https://github.com/Cray-HPE/cray-istio/blob/master/CONTRIBUTING.md">CONTRIBUTING.md</a> file for how to contribute to this project.