Skip to content

Commit

Permalink
Add Troubleshooting Steps
Browse files Browse the repository at this point in the history
* Istio Pods Crashing post upgrade due to `fs.inotify` Limits
  • Loading branch information
arka-pramanik-hpe committed Oct 4, 2024
1 parent d643e2e commit 414a211
Showing 1 changed file with 42 additions and 0 deletions.
42 changes: 42 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,48 @@ manifestgen -c customizations.yaml -i sysmgmt.yaml -o sysmgmt-cust.yaml
```sh
loftsman ship --charts-path <tgz-charts-path> --manifest-path sysmgmt-cust.yaml
```
Here’s a suggested `README.md` section outlining the troubleshooting steps and the workaround for the inotify issue after the Istio upgrade:

---

## Troubleshooting: Istio Pods Crashing After Upgrade Due to `fs.inotify` Limits

### Issue Description:
After upgrading Istio, nodes may not have booted into the new image with the increased `fs.inotify.max_user_instances` and `fs.inotify.max_user_watches` values. When pods restart (such as after a power outage or node reboot), they may fail due to insufficient inotify resources, as the limits on the system are too low.

This issue manifests when:
- Pods are unable to create enough inotify instances to monitor required files.
- The system hits the maximum number of file watches, causing crashes or failures in services dependent on file system event monitoring.

This problem can be triggered by events like:
- A node dying and rebooting mid-upgrade.
- Power outages where pods restart on nodes with old kernel settings.

### Related Issue:

- [Istio Issue #35829](https://github.com/istio/istio/issues/35829)

The problem was worked around by following the advice to increase `fs.inotify.max_user_instances` from 128 to 1024 as suggested in the GitHub thread.

### Steps to Reproduce:
1. Upgrade Istio without rebooting nodes into an image with updated kernel parameter values.
2. Trigger a pod restart or node reboot.
3. Observe pods failing to start due to inotify-related errors.

### Workaround:

Manually increase the `fs.inotify.max_user_instances` and `fs.inotify.max_user_watches` values to provide sufficient resources for Istio and other Kubernetes components.

```bash
pdsh -w ncn-m00[1-3],ncn-w00[1-5] 'sysctl -w fs.inotify.max_user_instances=1024'
pdsh -w ncn-m00[1-3],ncn-w00[1-5] 'sysctl -w user.max_inotify_instances=1024'
pdsh -w ncn-m00[1-3],ncn-w00[1-5] 'sysctl -w fs.inotify.max_user_watches=1048576'
pdsh -w ncn-m00[1-3],ncn-w00[1-5] 'sysctl -w user.max_inotify_watches=1048576'
```

This increases the limits from the default values to higher values, allowing for more inotify instances and file watches.

---

## Contributing
See the <a href="https://github.com/Cray-HPE/cray-istio/blob/master/CONTRIBUTING.md">CONTRIBUTING.md</a> file for how to contribute to this project.
Expand Down

0 comments on commit 414a211

Please sign in to comment.