Skip to content

Commit

Permalink
CASMINST-7165: Linting
Browse files Browse the repository at this point in the history
  • Loading branch information
mharding-hpe committed Jan 31, 2025
1 parent 79082e5 commit 9af61fa
Show file tree
Hide file tree
Showing 2 changed files with 25 additions and 30 deletions.
34 changes: 16 additions & 18 deletions troubleshooting/known_issues/cfs-api_pods_in_CLBO_state.md
Original file line number Diff line number Diff line change
@@ -1,39 +1,37 @@
# `CFS-API` pods in CLBO state during CSM install
# `cfs-api` pods in CLBO state during CSM install

## Issue Description
## Issue description

When installing CSM 1.6, `cray-shared-kafka-kafka-*` pods in the services namespace fail to come up which results in `CFS-API` pods in CLBO state. This happens because of an issue with Zookeeper related to slow DNS.
Zookeeper fails to come up if the DNS is not set up for all hosts at startup. When this happens, the cluster gets stuck at Zookeeper pods running, but brokers not coming up.
When installing CSM 1.6, `cray-shared-kafka-kafka` Kubernetes pods in the `services` namespace fail to come up which results in
`cfs-api` pods in the `CrashLoopBackOff` state. This happens because of an issue with Zookeeper related to slow DNS.
Zookeeper fails to come up if the DNS is not set up for all hosts at startup. When this happens, the cluster gets stuck with
the Zookeeper pods running, but brokers not coming up.

### Related Issue
This problem can be triggered by events such as slow DNS propagation to Kubernetes DNS subsystem.

- [Zookeeper Issue #4708](https://issues.apache.org/jira/browse/ZOOKEEPER-4708)
For more information on the root cause, see [Zookeeper Issue #4708](https://issues.apache.org/jira/browse/ZOOKEEPER-4708).

## Error Identification
## Error identification

When the issue occurs, the `cray-shared-kafka-kafka-*` pods in the services namespace fail to come up and will not be present.
Also, `CFS-API` pods will be in CLBO state.
When the issue occurs, the `cray-shared-kafka-kafka` pods in the `services` namespace fail to come up and will not be present,
and the `cfs-api` pods will be in the CLBO state.

The logs from `strimzi-cluster-operator-*` pod in the operators namespace will be throwing errors as follows:
The logs from `strimzi-cluster-operator-*` pod in the `operators` namespace will contain messages similar to the following:

```text
2024-10-04T22:16:54.899932465Z 2024-10-04 22:16:54 ERROR StaticHostProvider:148 - Unable to resolve address: cray-shared-kafka-zookeeper-0.cray-shared-kafka-zookeeper-nodes.services.svc/<unresolved>:2181
2024-10-04T22:16:54.899952739Z java.net.UnknownHostException: cray-shared-kafka-zookeeper-0.cray-shared-kafka-zookeeper-nodes.services.svc: Name or service not known
```

```text
2024-10-04T22:21:54.061164856Z 2024-10-04 22:21:54 ERROR VertxUtil:127 - Reconciliation #1(watch) Kafka(services/cray-shared-kafka):Exceeded timeout of 300000ms while waiting for ZooKeeperAdmin connection to cray-shared-kafka-zookeeper-0.cray-shared-kafka-zookeeper-nodes.services.svc:2181,cray-shared-kafka-zookeeper-1.cray-shared-kafka-zookeeper-nodes.services.svc:2181,cray-shared-kafka-zookeeper-2.cray-shared-kafka-zookeeper-nodes.services.svc:2181 to be connected
2024-10-04T22:21:54.061644246Z 2024-10-04 22:21:54 WARN ZookeeperScaler:157 - Reconciliation #1(watch) Kafka(services/cray-shared-kafka): Failed to connect to Zookeeper cray-shared-kafka-zookeeper-0.cray-shared-kafka-zookeeper-nodes.services.svc:2181,cray-shared-kafka-zookeeper-1.cray-shared-kafka-zookeeper-nodes.services.svc:2181,cray-shared-kafka-zookeeper-2.cray-shared-kafka-zookeeper-nodes.services.svc:2181. Connection was not ready in 300000 ms.
2024-10-04T22:21:54.466771715Z 2024-10-04 22:21:54 WARN ZooKeeperReconciler:834 - Reconciliation #1(watch) Kafka(services/cray-shared-kafka): Failed to verify Zookeeper configuration
```

## Error Conditions

This problem can be triggered by events like:

- Slow DNS propagation to Kubernetes DNS subsystem

## Fix Description
## Fix description

The workaround is to delete the zookeeper pods and let them be re-created by the Strimzi operator.
(`ncn-mw#`) The workaround is to delete the Zookeeper pods and let them be re-created by the Strimzi operator.

```bash
kubectl delete pods -n services -l strimzi.io/controller-name=cray-shared-kafka-zookeeper
Expand Down
21 changes: 9 additions & 12 deletions upgrade/Prepare_for_Upgrade_to_Next_CSM_Major_Version.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,18 +3,6 @@

Before beginning an upgrade from CSM 1.7 to CSM 1.8, there are a few things to do on the system first.

- [Reduced resiliency during upgrade](#reduced-resiliency-during-upgrade)
- [Preparation steps]

1. [Start typescript](#1-start-typescript)
1. [Ensure latest documentation installed](#2-ensure-latest-documentation-is-installed)
1. [Export Nexus data](#3-export-nexus-data)
1. [Adding switch admin password to Vault](#4-adding-switch-admin-password-to-vault)
1. [Ensure SNMP is configured on the management network switches](#5-ensure-snmp-is-configured-on-the-management-network-switches)
1. [Running sessions](#6-running-sessions)
1. [Health validation](#7-health-validation)
1. [Stop typescript](#8-stop-typescript)

## Reduced resiliency during upgrade

**Warning:** Management service resiliency is reduced during the upgrade.
Expand All @@ -30,6 +18,15 @@ completes its upgrade, then quorum would be lost.

## Preparation steps

1. [Start typescript](#1-start-typescript)
1. [Ensure latest documentation installed](#2-ensure-latest-documentation-is-installed)
1. [Export Nexus data](#3-export-nexus-data)
1. [Adding switch admin password to Vault](#4-adding-switch-admin-password-to-vault)
1. [Ensure SNMP is configured on the management network switches](#5-ensure-snmp-is-configured-on-the-management-network-switches)
1. [Running sessions](#6-running-sessions)
1. [Health validation](#7-health-validation)
1. [Stop typescript](#8-stop-typescript)

### 1. Start typescript

1. (`ncn-m001#`) If a typescript session is already running in the shell, then first stop it with
Expand Down

0 comments on commit 9af61fa

Please sign in to comment.