Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Troubleshooting Steps #46

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

arka-pramanik-hpe
Copy link
Contributor

Summary and Scope

Update Documentation of cray-istio for Troubleshooting Istio Pods Crashing post upgrade due to fs.inotify Limits.
After the Istio upgrade, the nodes have not yet been rebooted into a new image where these limits (fs.inotify.max_user_instances and fs.inotify.max_user_watches) have been increased. As a result, when the pods are restarted, they might be trying to monitor more files or create more inotify instances than allowed by the system. This can cause the pods to crash or fail because they are unable to watch the files or directories they need, which may be critical for service mesh operations like traffic management, logging, or configuration changes.
In addition, power outage or node reboot mid-upgrade would hit the same problem because the pods would restart without the required kernel parameters being updated on the nodes.

Issues and Related PRs

List and characterize relationship to Jira/Github issues and other pull requests. Be sure to list dependencies.

Testing

List the environments in which these changes were tested.

Tested on:

  • surtur
  • starlord
  • Local development environment
  • Virtual Shasta

Test description:

Increased the fs.inotify.max_user_instances and fs.inotify.max_user_watches values to provide sufficient resources for Istio.

  • Were the install/upgrade-based validation checks/tests run (goss tests/install-validation doc)?
  • Were continuous integration tests run? If not, why?
  • Was upgrade tested? If not, why? Y
  • Was downgrade tested? If not, why? Y
  • Were new tests (or test issues/Jiras) created for this change?

Risks and Mitigations

Low.

Pull Request Checklist

  • Version number(s) incremented, if applicable
  • Copyrights updated
  • License file intact
  • Target branch correct
  • CHANGELOG.md updated
  • Testing is appropriate and complete, if applicable
  • HPC Product Announcement prepared, if applicable

@spillerc-hpe
Copy link

Is there a related PR for docs-csm? No customer will ever find this documentation.

@arka-pramanik-hpe
Copy link
Contributor Author

Is there a related PR for docs-csm? No customer will ever find this documentation.

PR Raised for docs-csm Cray-HPE/docs-csm#5430
Added documentation in Troubleshooting/Known_issues according to @leliasen-hpe

* Istio Pods Crashing post upgrade due to `fs.inotify` Limits
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants