From 125197f3e6739762f615857eca7066b924ec972b Mon Sep 17 00:00:00 2001 From: Kofi B Date: Wed, 15 Jan 2025 11:57:40 -0800 Subject: [PATCH] Added additional entries for troubleshooting unhealthy cluster (#119914) * Added additional entries for troubleshooting unhealthy cluster Reordered "Re-enable shard allocation" because not as common as other causes Added additional causes of yellow statuses Changed watermark commadn to include high and low watermark so users can make their cluster operate once again. * Drive-by copyedit with suggestions for concision and some formatting fixes. Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com> * Concision and some formatting fixes. Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com> * Colon added Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com> * Update docs/reference/troubleshooting/common-issues/red-yellow-cluster-status.asciidoc Co-authored-by: shainaraskas <58563081+shainaraskas@users.noreply.github.com> * Title change Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com> * Update docs/reference/troubleshooting/common-issues/red-yellow-cluster-status.asciidoc Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com> * Spelling fix * Update docs/reference/troubleshooting/common-issues/red-yellow-cluster-status.asciidoc Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com> * Update docs/reference/troubleshooting/common-issues/red-yellow-cluster-status.asciidoc Co-authored-by: George Wallace * Update docs/reference/troubleshooting/common-issues/red-yellow-cluster-status.asciidoc Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com> * Update docs/reference/troubleshooting/common-issues/red-yellow-cluster-status.asciidoc Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com> --------- Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com> Co-authored-by: shainaraskas <58563081+shainaraskas@users.noreply.github.com> Co-authored-by: George Wallace --- .../red-yellow-cluster-status.asciidoc | 71 +++++++++++++------ 1 file changed, 48 insertions(+), 23 deletions(-) diff --git a/docs/reference/troubleshooting/common-issues/red-yellow-cluster-status.asciidoc b/docs/reference/troubleshooting/common-issues/red-yellow-cluster-status.asciidoc index c07e92c058991..5d74ca66ee6b3 100644 --- a/docs/reference/troubleshooting/common-issues/red-yellow-cluster-status.asciidoc +++ b/docs/reference/troubleshooting/common-issues/red-yellow-cluster-status.asciidoc @@ -78,35 +78,31 @@ A shard can become unassigned for several reasons. The following tips outline th most common causes and their solutions. [discrete] -[[fix-cluster-status-reenable-allocation]] -===== Re-enable shard allocation +[[fix-cluster-status-only-one-node]] +===== Single node cluster -You typically disable allocation during a <> or other -cluster maintenance. If you forgot to re-enable allocation afterward, {es} will -be unable to assign shards. To re-enable allocation, reset the -`cluster.routing.allocation.enable` cluster setting. +{es} will never assign a replica to the same node as the primary shard. A single-node cluster will always have yellow status. To change to green, set <> to 0 for all indices. -[source,console] ----- -PUT _cluster/settings -{ - "persistent" : { - "cluster.routing.allocation.enable" : null - } -} ----- - -See https://www.youtube.com/watch?v=MiKKUdZvwnI[this video] for walkthrough of troubleshooting "no allocations are allowed". +Therefore, if the number of replicas equals or exceeds the number of nodes, some shards won't be allocated. [discrete] [[fix-cluster-status-recover-nodes]] ===== Recover lost nodes Shards often become unassigned when a data node leaves the cluster. This can -occur for several reasons, ranging from connectivity issues to hardware failure. +occur for several reasons: + +* A manual node restart will cause a temporary unhealthy cluster state until the node recovers. + +* When a node becomes overloaded or fails, it can temporarily disrupt the cluster’s health, leading to an unhealthy state. Prolonged garbage collection (GC) pauses, caused by out-of-memory errors or high memory usage during intensive searches, can trigger this state. See <> for more JVM-related issues. + +* Network issues can prevent reliable node communication, causing shards to become out of sync. Check the logs for repeated messages about nodes leaving and rejoining the cluster. + After you resolve the issue and recover the node, it will rejoin the cluster. {es} will then automatically allocate any unassigned shards. +You can monitor this process by <>. The number of unallocated shards should progressively decrease until green status is reached. + To avoid wasting resources on temporary issues, {es} <> by one minute by default. If you've recovered a node and don’t want to wait for the delay period, you can call the <> or add a delete phase. If you no longer need to search the data, you @@ -219,11 +216,39 @@ watermark or set it to an explicit byte value. PUT _cluster/settings { "persistent": { - "cluster.routing.allocation.disk.watermark.low": "30gb" + "cluster.routing.allocation.disk.watermark.low": "90%", + "cluster.routing.allocation.disk.watermark.high": "95%" } } ---- -// TEST[s/"30gb"/null/] +// TEST[s/"90%"/null/] +// TEST[s/"95%"/null/] + +[IMPORTANT] +==== +This is usually a temporary solution and may cause instability if disk space is not freed up. +==== + +[discrete] +[[fix-cluster-status-reenable-allocation]] +===== Re-enable shard allocation + +You typically disable allocation during a <> or other +cluster maintenance. If you forgot to re-enable allocation afterward, {es} will +be unable to assign shards. To re-enable allocation, reset the +`cluster.routing.allocation.enable` cluster setting. + +[source,console] +---- +PUT _cluster/settings +{ + "persistent" : { + "cluster.routing.allocation.enable" : null + } +} +---- + +See https://www.youtube.com/watch?v=MiKKUdZvwnI[this video] for walkthrough of troubleshooting "no allocations are allowed". [discrete] [[fix-cluster-status-jvm]] @@ -271,4 +296,4 @@ POST _cluster/reroute // TEST[s/^/PUT my-index\n/] // TEST[catch:bad_request] -See https://www.youtube.com/watch?v=6OAg9IyXFO4[this video] for a walkthrough of troubleshooting `no_valid_shard_copy`. \ No newline at end of file +See https://www.youtube.com/watch?v=6OAg9IyXFO4[this video] for a walkthrough of troubleshooting `no_valid_shard_copy`.