-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retina Windows Crashing Due to $env Being Set to C:\hpc\config on Helm Chart Redeploy #1138
Comments
rayaisaiah
changed the title
Retina Windows Crashing Due to $env Not Saving on Helm Chart Redeploy
Retina Windows Crashing Due to $env Being Set to C:\hpc\config on Helm Chart Redeploy
Dec 12, 2024
7 tasks
github-merge-queue bot
pushed a commit
that referenced
this issue
Jan 3, 2025
…ues (#1128) # Description This PR aims to fix the stability of the retina windows agent. There were 4 causes identified and each commit resolves one respectively. 1. Invalid rendering of the namespace helm value (1st commit) ``` matmerr@matmerr-cloud-dev: ~/go/src/github.com/Azure/telescope [06:56:29 PM][matmerr-aks-pktmon-11][matmerr/enable-ama]$ k logs -f retina-agent-win-7f7kb Starting Retina Agent starting Retina daemon with legacy control plane v0.0.17 2024/12/02 18:56:22 metricsInterval is deprecated, please use metricsIntervalDuration instead init client-go KUBECONFIG set, using kubeconfig: C:\hpc\kubeconfig Error: starting daemon: creating controller-runtime manager: error loading config file "C:\hpc\kubeconfig": yaml: invalid map key: map[interface {}]interface {}{".Values.namespace":interface {}(nil)} ``` 2. Default operator value is enabled and will cause RBAC issues for the windows agents (2nd commit) ``` ts=2024-12-10T16:58:48.634Z level=info caller=hnsstats/hnsstats_windows.go:212 msg="Start hnsstats plugin..." W1210 16:58:49.990792 7108 reflector.go:547] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:232: failed to list *v1alpha1.MetricsConfiguration: metricsconfigurations.retina.sh is forbidden: User "system:serviceaccount:kube-system:retina-agent" cannot list resource "metricsconfigurations" in API group "retina.sh" at the cluster scope ``` 3. Telemetry enabled also causes the agent to panic if application insights is not defined. User can change the config map as desired but default values should not cause the agent to crash (3rd commit) 4. `kubeconfig` file cannot be found for the legacy chart values. Executing the `setkubeconfigpath.ps1` was required for the container setup (4th commit). Update: It was later found that the missing `kubeconfig` error only exists on redeploy if the initial retina was before this change (#1118). A later GH issue was created - #1138 ``` beegii@bignamboi:~/src/retina$ k logs retina-agent-win-4tl7m -n kube-system Starting Retina Agent starting Retina daemon with legacy control plane v0.0.17 2024/12/11 18:40:15 metricsInterval is deprecated, please use metricsIntervalDuration instead init client-go KUBECONFIG set, using kubeconfig: C:\hpc\kubeconfig Error: starting daemon: creating controller-runtime manager: CreateFile C:\hpc\kubeconfig: The system cannot find the file specified. ``` ## Related Issue #1122 ## Checklist - [x] I have read the [contributing documentation](https://retina.sh/docs/contributing). - [x] I signed and signed-off the commits (`git commit -S -s ...`). See [this documentation](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification) on signing commits. - [x] I have correctly attributed the author(s) of the code. - [x] I have tested the changes locally. - [x] I have followed the project's style guidelines. - [x] I have updated the documentation, if necessary. - [x] I have added tests, if applicable. ## Screenshots (if applicable) or Testing Completed Each commit corresponding image was built and tested on the cluster to confirm each fix works! ![image](https://github.com/user-attachments/assets/dde7fe23-22ff-49bf-8c96-2c1a42c96f9d) ## Additional Notes First three problems were experienced when deploying retina using the hubble path and the last issue was experienced when deploying retina using the legacy path --- Please refer to the [CONTRIBUTING.md](../CONTRIBUTING.md) file for more information on how to contribute to this project.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
For Retina Windows daemonset there is a bug that occurs where the env variable does not correctly set in the powershell.exe command when the helm chart is installed with an incorrect command before being uninstalled and reinstalled with a correct command.
This results in the retina-win pods crashloopbackoff'ing with the error:
Error: starting daemon: creating controller-runtime manager: CreateFile C:\hpc\kubeconfig: The system cannot find the file specified.
To Reproduce
Steps to reproduce the behavior:
make helm-install-without-tls
to install retina with hubblemake helm-uninstall
to remove retina podsmake helm-install-without-tls
to install retina with hubbleError: starting daemon: creating controller-runtime manager: CreateFile C:\hpc\kubeconfig: The system cannot find the file specified.
Expected behavior
The #env variable for KUBECONFIG is incorrectly set to
C:\hpc\config
. The retina-win pods crashloopbackoff'ing with the error:Error: starting daemon: creating controller-runtime manager: CreateFile C:\hpc\kubeconfig: The system cannot find the file specified.
Screenshots
Platform (please complete the following information):
Additional context
Discovered after testing #1118 and setting the incorrect Powershell command for the Retina Windows helm chart.
Mitigation
Configure the Retina Windows daemonset with the latest helm chart on Main and create new Windows nodes. The retina-win pods that come up will have the working powershell commands and set KUBECONFIG correctly. Alternatively create a fresh cluster and helm install the Windows daemonset with the correct commands.
The text was updated successfully, but these errors were encountered: