Skip to content

Commit

Permalink
Add Troubleshooting Steps
Browse files Browse the repository at this point in the history
* Istio Pods Crashing post upgrade due to `fs.inotify` Limits
  • Loading branch information
arka-pramanik-hpe committed Oct 9, 2024
1 parent d643e2e commit 0fe243d
Showing 1 changed file with 68 additions and 12 deletions.
80 changes: 68 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,27 @@
# cray-istio Upgrade Notes

## Description

Cray-Istio is a customized version of the Istio service mesh tailored for HPE's Cray supercomputers and high-performance computing (HPC) workloads. It optimizes Istio's performance to minimize overhead and maximize speed for demanding HPC tasks. Cray-Istio integrates seamlessly with HPE's HPC ecosystem, allowing coordinated management with schedulers and resource managers. It might also include additional security features relevant to HPC environments. This runs after cray-istio-deploy which creates the Istio CRDs that
are used by this chart (Gateways, VirtualServices, etc.).<br>
Understanding Cray-Istio builds upon the foundation of Istio, an open-source service mesh. Istio provides features like traffic management, security, and observability for microservices, making it valuable for managing complex HPC deployments.

## Pre-requisites

- As part of upgrading to a new version, make sure the latest version images are added to artifactory.
- Helm does not support upgrading CRDs during chart upgrade. They need to be upgraded explicitly which is handled as part of the templates.
- Make sure to update the latest CRDs in following:
- <a href="https://github.com/Cray-HPE/cray-istio/tree/master/kubernetes/cray-istio-operator">cray-istio-operator</a>
- <a href="https://github.com/Cray-HPE/cray-istio/tree/master/kubernetes/cray-istio-deploy">cray-istio-deploy</a>
- <a href="https://github.com/Cray-HPE/cray-istio/tree/master/kubernetes/cray-istio">cray-istio</a>
- <a href="https://github.com/Cray-HPE/cray-istio/tree/master/kubernetes/cray-istio-operator">cray-istio-operator</a>
- <a href="https://github.com/Cray-HPE/cray-istio/tree/master/kubernetes/cray-istio-deploy">cray-istio-deploy</a>
- <a href="https://github.com/Cray-HPE/cray-istio/tree/master/kubernetes/cray-istio">cray-istio</a>

## Upgrade

Once the changes are done, a tag can be pushed so that the helm charts are added to artifactory in stable directory. These charts can be deployed on a cluster using standard loftsman commands after applying the customizations over them.<br>

## STEPS:
## STEPS

### Create sysmgmt.yaml like the following:
### Create sysmgmt.yaml like the following

```yaml
apiVersion: manifests/v1beta1
Expand All @@ -43,20 +46,73 @@ spec:
timeout: 20m
```
### Apply the following command to applying the customizations over them:
### Apply the following command to applying the customizations over them
```sh
manifestgen -c customizations.yaml -i sysmgmt.yaml -o sysmgmt-cust.yaml
```

### Upgrade the charts using loftsman:
### Upgrade the charts using loftsman

```sh
```sh
loftsman ship --charts-path <tgz-charts-path> --manifest-path sysmgmt-cust.yaml
```

---

## Troubleshooting: Istio-Proxy failing with too many open files

## Issue Description

After the CSM upgrade, some nodes with `Istio` might not have come up with the new `Istio-proxy` image due to too many open files so they need increased `fs.inotify.max_user_instances` and `fs.inotify.max_user_watches` values.
When pods with `istio-proxy` restart (such as after a power outage or node reboot), they may fail due to insufficient `inotify` resources, as the limits on the system are too low.

### Related Issue

- [Istio Issue #35829](https://github.com/istio/istio/issues/35829)

## Error Identification

When the issue occurs the following errors are emitted in the `istio-proxy` logs.

```sh
2024-07-22T17:00:37.322350Z info Workload SDS socket not found. Starting Istio SDS Server
2024-07-22T17:00:37.322393Z info CA Endpoint istiod.istio-system.svc:15012, provider Citadel
2024-07-22T17:00:37.322395Z info Opening status port 15020
2024-07-22T17:00:37.322436Z info Using CA istiod.istio-system.svc:15012 cert with certs: var/run/secrets/istio/root-cert.pem
2024-07-22T17:00:37.323487Z error failed to start SDS server: failed to start workload secret manager too many open files
Error: failed to start SDS server: failed to start workload secret manager too many open files
```

## Error Conditions

This issue manifests when:

- Pods are unable to create enough `inotify` instances to monitor required files.
- The system hits the maximum number of file watches, causing crashes or failures in services dependent on file system event monitoring.

This problem can be triggered by events like:

- A node dying and rebooting mid-upgrade.
- Power outages where pods restart on nodes with old kernel settings.

## Fix Description

Manually increase the `fs.inotify.max_user_instances` and `fs.inotify.max_user_watches` values to provide sufficient resources for Istio and other Kubernetes components.

```bash
pdsh -w ncn-m00[1-3],ncn-w00[1-5] 'sysctl -w fs.inotify.max_user_instances=1024'
pdsh -w ncn-m00[1-3],ncn-w00[1-5] 'sysctl -w user.max_inotify_instances=1024'
pdsh -w ncn-m00[1-3],ncn-w00[1-5] 'sysctl -w fs.inotify.max_user_watches=1048576'
pdsh -w ncn-m00[1-3],ncn-w00[1-5] 'sysctl -w user.max_inotify_watches=1048576'
```

---

## Contributing

See the <a href="https://github.com/Cray-HPE/cray-istio/blob/master/CONTRIBUTING.md">CONTRIBUTING.md</a> file for how to contribute to this project.

## License
This project is copyrighted by Hewlett Packard Enterprise Development LP and is distributed under the MIT license. See the <a href="https://github.com/Cray-HPE/cray-istio/blob/master/LICENSE">LICENSE</a> file for details.

This project is copyrighted by Hewlett Packard Enterprise Development LP and is distributed under the MIT license. See the <a href="https://github.com/Cray-HPE/cray-istio/blob/master/LICENSE">LICENSE</a> file for details.

0 comments on commit 0fe243d

Please sign in to comment.