From 9af61fa41a7d8f7a2e2b63b72e755504157f3737 Mon Sep 17 00:00:00 2001 From: "Mitch Harding (the weird one)" Date: Fri, 31 Jan 2025 16:11:59 -0500 Subject: [PATCH] CASMINST-7165: Linting --- .../cfs-api_pods_in_CLBO_state.md | 34 +++++++++---------- ...e_for_Upgrade_to_Next_CSM_Major_Version.md | 21 +++++------- 2 files changed, 25 insertions(+), 30 deletions(-) diff --git a/troubleshooting/known_issues/cfs-api_pods_in_CLBO_state.md b/troubleshooting/known_issues/cfs-api_pods_in_CLBO_state.md index ff8ced110649..f52f9da723c1 100644 --- a/troubleshooting/known_issues/cfs-api_pods_in_CLBO_state.md +++ b/troubleshooting/known_issues/cfs-api_pods_in_CLBO_state.md @@ -1,39 +1,37 @@ -# `CFS-API` pods in CLBO state during CSM install +# `cfs-api` pods in CLBO state during CSM install -## Issue Description +## Issue description -When installing CSM 1.6, `cray-shared-kafka-kafka-*` pods in the services namespace fail to come up which results in `CFS-API` pods in CLBO state. This happens because of an issue with Zookeeper related to slow DNS. -Zookeeper fails to come up if the DNS is not set up for all hosts at startup. When this happens, the cluster gets stuck at Zookeeper pods running, but brokers not coming up. +When installing CSM 1.6, `cray-shared-kafka-kafka` Kubernetes pods in the `services` namespace fail to come up which results in +`cfs-api` pods in the `CrashLoopBackOff` state. This happens because of an issue with Zookeeper related to slow DNS. +Zookeeper fails to come up if the DNS is not set up for all hosts at startup. When this happens, the cluster gets stuck with +the Zookeeper pods running, but brokers not coming up. -### Related Issue +This problem can be triggered by events such as slow DNS propagation to Kubernetes DNS subsystem. -- [Zookeeper Issue #4708](https://issues.apache.org/jira/browse/ZOOKEEPER-4708) +For more information on the root cause, see [Zookeeper Issue #4708](https://issues.apache.org/jira/browse/ZOOKEEPER-4708). -## Error Identification +## Error identification -When the issue occurs, the `cray-shared-kafka-kafka-*` pods in the services namespace fail to come up and will not be present. -Also, `CFS-API` pods will be in CLBO state. +When the issue occurs, the `cray-shared-kafka-kafka` pods in the `services` namespace fail to come up and will not be present, +and the `cfs-api` pods will be in the CLBO state. -The logs from `strimzi-cluster-operator-*` pod in the operators namespace will be throwing errors as follows: +The logs from `strimzi-cluster-operator-*` pod in the `operators` namespace will contain messages similar to the following: ```text 2024-10-04T22:16:54.899932465Z 2024-10-04 22:16:54 ERROR StaticHostProvider:148 - Unable to resolve address: cray-shared-kafka-zookeeper-0.cray-shared-kafka-zookeeper-nodes.services.svc/:2181 2024-10-04T22:16:54.899952739Z java.net.UnknownHostException: cray-shared-kafka-zookeeper-0.cray-shared-kafka-zookeeper-nodes.services.svc: Name or service not known +``` +```text 2024-10-04T22:21:54.061164856Z 2024-10-04 22:21:54 ERROR VertxUtil:127 - Reconciliation #1(watch) Kafka(services/cray-shared-kafka):Exceeded timeout of 300000ms while waiting for ZooKeeperAdmin connection to cray-shared-kafka-zookeeper-0.cray-shared-kafka-zookeeper-nodes.services.svc:2181,cray-shared-kafka-zookeeper-1.cray-shared-kafka-zookeeper-nodes.services.svc:2181,cray-shared-kafka-zookeeper-2.cray-shared-kafka-zookeeper-nodes.services.svc:2181 to be connected 2024-10-04T22:21:54.061644246Z 2024-10-04 22:21:54 WARN ZookeeperScaler:157 - Reconciliation #1(watch) Kafka(services/cray-shared-kafka): Failed to connect to Zookeeper cray-shared-kafka-zookeeper-0.cray-shared-kafka-zookeeper-nodes.services.svc:2181,cray-shared-kafka-zookeeper-1.cray-shared-kafka-zookeeper-nodes.services.svc:2181,cray-shared-kafka-zookeeper-2.cray-shared-kafka-zookeeper-nodes.services.svc:2181. Connection was not ready in 300000 ms. 2024-10-04T22:21:54.466771715Z 2024-10-04 22:21:54 WARN ZooKeeperReconciler:834 - Reconciliation #1(watch) Kafka(services/cray-shared-kafka): Failed to verify Zookeeper configuration ``` -## Error Conditions - -This problem can be triggered by events like: - -- Slow DNS propagation to Kubernetes DNS subsystem - -## Fix Description +## Fix description -The workaround is to delete the zookeeper pods and let them be re-created by the Strimzi operator. +(`ncn-mw#`) The workaround is to delete the Zookeeper pods and let them be re-created by the Strimzi operator. ```bash kubectl delete pods -n services -l strimzi.io/controller-name=cray-shared-kafka-zookeeper diff --git a/upgrade/Prepare_for_Upgrade_to_Next_CSM_Major_Version.md b/upgrade/Prepare_for_Upgrade_to_Next_CSM_Major_Version.md index 4b43009e39e0..e52312d9a497 100644 --- a/upgrade/Prepare_for_Upgrade_to_Next_CSM_Major_Version.md +++ b/upgrade/Prepare_for_Upgrade_to_Next_CSM_Major_Version.md @@ -3,18 +3,6 @@ Before beginning an upgrade from CSM 1.7 to CSM 1.8, there are a few things to do on the system first. -- [Reduced resiliency during upgrade](#reduced-resiliency-during-upgrade) -- [Preparation steps] - - 1. [Start typescript](#1-start-typescript) - 1. [Ensure latest documentation installed](#2-ensure-latest-documentation-is-installed) - 1. [Export Nexus data](#3-export-nexus-data) - 1. [Adding switch admin password to Vault](#4-adding-switch-admin-password-to-vault) - 1. [Ensure SNMP is configured on the management network switches](#5-ensure-snmp-is-configured-on-the-management-network-switches) - 1. [Running sessions](#6-running-sessions) - 1. [Health validation](#7-health-validation) - 1. [Stop typescript](#8-stop-typescript) - ## Reduced resiliency during upgrade **Warning:** Management service resiliency is reduced during the upgrade. @@ -30,6 +18,15 @@ completes its upgrade, then quorum would be lost. ## Preparation steps +1. [Start typescript](#1-start-typescript) +1. [Ensure latest documentation installed](#2-ensure-latest-documentation-is-installed) +1. [Export Nexus data](#3-export-nexus-data) +1. [Adding switch admin password to Vault](#4-adding-switch-admin-password-to-vault) +1. [Ensure SNMP is configured on the management network switches](#5-ensure-snmp-is-configured-on-the-management-network-switches) +1. [Running sessions](#6-running-sessions) +1. [Health validation](#7-health-validation) +1. [Stop typescript](#8-stop-typescript) + ### 1. Start typescript 1. (`ncn-m001#`) If a typescript session is already running in the shell, then first stop it with