diff --git a/README.md b/README.md index b88ae64..9a1566e 100644 --- a/README.md +++ b/README.md @@ -1,24 +1,27 @@ # cray-istio Upgrade Notes - + ## Description + Cray-Istio is a customized version of the Istio service mesh tailored for HPE's Cray supercomputers and high-performance computing (HPC) workloads. It optimizes Istio's performance to minimize overhead and maximize speed for demanding HPC tasks. Cray-Istio integrates seamlessly with HPE's HPC ecosystem, allowing coordinated management with schedulers and resource managers. It might also include additional security features relevant to HPC environments. This runs after cray-istio-deploy which creates the Istio CRDs that are used by this chart (Gateways, VirtualServices, etc.).
Understanding Cray-Istio builds upon the foundation of Istio, an open-source service mesh. Istio provides features like traffic management, security, and observability for microservices, making it valuable for managing complex HPC deployments. - + ## Pre-requisites + - As part of upgrading to a new version, make sure the latest version images are added to artifactory. - Helm does not support upgrading CRDs during chart upgrade. They need to be upgraded explicitly which is handled as part of the templates. - Make sure to update the latest CRDs in following: - - cray-istio-operator - - cray-istio-deploy - - cray-istio + - cray-istio-operator + - cray-istio-deploy + - cray-istio ## Upgrade + Once the changes are done, a tag can be pushed so that the helm charts are added to artifactory in stable directory. These charts can be deployed on a cluster using standard loftsman commands after applying the customizations over them.
-## STEPS: +## STEPS -### Create sysmgmt.yaml like the following: +### Create sysmgmt.yaml like the following ```yaml apiVersion: manifests/v1beta1 @@ -43,20 +46,73 @@ spec: timeout: 20m ``` -### Apply the following command to applying the customizations over them: +### Apply the following command to applying the customizations over them ```sh manifestgen -c customizations.yaml -i sysmgmt.yaml -o sysmgmt-cust.yaml ``` -### Upgrade the charts using loftsman: +### Upgrade the charts using loftsman -```sh +```sh loftsman ship --charts-path --manifest-path sysmgmt-cust.yaml ``` +--- + +## Troubleshooting: Istio-Proxy failing with too many open files + +## Issue Description + +After the CSM upgrade, some nodes with `Istio` might not have come up with the new `Istio-proxy` image due to too many open files so they need increased `fs.inotify.max_user_instances` and `fs.inotify.max_user_watches` values. +When pods with `istio-proxy` restart (such as after a power outage or node reboot), they may fail due to insufficient `inotify` resources, as the limits on the system are too low. + +### Related Issue + +- [Istio Issue #35829](https://github.com/istio/istio/issues/35829) + +## Error Identification + +When the issue occurs the following errors are emitted in the `istio-proxy` logs. + +```sh +2024-07-22T17:00:37.322350Z info Workload SDS socket not found. Starting Istio SDS Server +2024-07-22T17:00:37.322393Z info CA Endpoint istiod.istio-system.svc:15012, provider Citadel +2024-07-22T17:00:37.322395Z info Opening status port 15020 +2024-07-22T17:00:37.322436Z info Using CA istiod.istio-system.svc:15012 cert with certs: var/run/secrets/istio/root-cert.pem +2024-07-22T17:00:37.323487Z error failed to start SDS server: failed to start workload secret manager too many open files +Error: failed to start SDS server: failed to start workload secret manager too many open files +``` + +## Error Conditions + +This issue manifests when: + +- Pods are unable to create enough `inotify` instances to monitor required files. +- The system hits the maximum number of file watches, causing crashes or failures in services dependent on file system event monitoring. + +This problem can be triggered by events like: + +- A node dying and rebooting mid-upgrade. +- Power outages where pods restart on nodes with old kernel settings. + +## Fix Description + +Manually increase the `fs.inotify.max_user_instances` and `fs.inotify.max_user_watches` values to provide sufficient resources for Istio and other Kubernetes components. + +```bash +pdsh -w ncn-m00[1-3],ncn-w00[1-5] 'sysctl -w fs.inotify.max_user_instances=1024' +pdsh -w ncn-m00[1-3],ncn-w00[1-5] 'sysctl -w user.max_inotify_instances=1024' +pdsh -w ncn-m00[1-3],ncn-w00[1-5] 'sysctl -w fs.inotify.max_user_watches=1048576' +pdsh -w ncn-m00[1-3],ncn-w00[1-5] 'sysctl -w user.max_inotify_watches=1048576' +``` + +--- + ## Contributing + See the CONTRIBUTING.md file for how to contribute to this project. - + ## License -This project is copyrighted by Hewlett Packard Enterprise Development LP and is distributed under the MIT license. See the LICENSE file for details. \ No newline at end of file + +This project is copyrighted by Hewlett Packard Enterprise Development LP and is distributed under the MIT license. See the LICENSE file for details.