Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added additional entries for troubleshooting unhealthy cluster #119914

Merged
merged 15 commits into from
Jan 15, 2025
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -74,35 +74,31 @@ A shard can become unassigned for several reasons. The following tips outline th
most common causes and their solutions.

[discrete]
[[fix-cluster-status-reenable-allocation]]
===== Re-enable shard allocation
[[fix-cluster-status-only-one-node]]
===== Single Node Cluster
thekofimensah marked this conversation as resolved.
Show resolved Hide resolved

You typically disable allocation during a <<restart-cluster,restart>> or other
cluster maintenance. If you forgot to re-enable allocation afterward, {es} will
be unable to assign shards. To re-enable allocation, reset the
`cluster.routing.allocation.enable` cluster setting.
{es} will never assign a replica to the same node as the primary shard. If you only have one node it is expected for your cluster to indicate yellow. If you prefer it to be green, then change the <<dynamic-index-number-of-replicas,num_of_replicas>> on each index to be 0.
thekofimensah marked this conversation as resolved.
Show resolved Hide resolved

[source,console]
----
PUT _cluster/settings
{
"persistent" : {
"cluster.routing.allocation.enable" : null
}
}
----

See https://www.youtube.com/watch?v=MiKKUdZvwnI[this video] for walkthrough of troubleshooting "no allocations are allowed".
Similarly if the number of replicas is equal to or exceeds the number of nodes, then it will not be possible to allocate one or more of the shards for the same reason.
thekofimensah marked this conversation as resolved.
Show resolved Hide resolved

[discrete]
[[fix-cluster-status-recover-nodes]]
===== Recover lost nodes

Shards often become unassigned when a data node leaves the cluster. This can
occur for several reasons, ranging from connectivity issues to hardware failure.
occur for several reasons.
thekofimensah marked this conversation as resolved.
Show resolved Hide resolved

* If you manually restart a node, then it will temporarily cause an unhealthy cluster until the node has recovered.

* If you have a node that is overloaded or has stopped operating for any reason, then it will temporarily cause an unhealthy cluster. Nodes may disconnect because of prolonged garbage collection (GC) pauses, which can result from "out of memory" errors or high memory usage due to intensive search operations. See <<fix-cluster-status-jvm,Reduce JVM memory pressure>> for more JVM related issues.

* If nodes cannot reliably communicate due to networking issues, they may lose contact with one another. This can cause shards to become out of sync. You can often identify this issue by checking the logs for repeated messages about nodes leaving and rejoining the cluster.
georgewallace marked this conversation as resolved.
Show resolved Hide resolved

After you resolve the issue and recover the node, it will rejoin the cluster.
{es} will then automatically allocate any unassigned shards.

You can monitor this process by <<cluster-health,checking your cluster health>>. You will see that the number of unallocated shards progressively reduces until green status is reached.
thekofimensah marked this conversation as resolved.
Show resolved Hide resolved

To avoid wasting resources on temporary issues, {es} <<delayed-allocation,delays
allocation>> by one minute by default. If you've recovered a node and don’t want
to wait for the delay period, you can call the <<cluster-reroute,cluster reroute
Expand Down Expand Up @@ -151,7 +147,7 @@ replica, it remains unassigned. To fix this, you can:

* Change the `index.number_of_replicas` index setting to reduce the number of
replicas for each primary shard. We recommend keeping at least one replica per
primary.
primary for high availability.

[source,console]
----
Expand All @@ -162,7 +158,6 @@ PUT _settings
----
// TEST[s/^/PUT my-index\n/]


[discrete]
[[fix-cluster-status-disk-space]]
===== Free up or increase disk space
Expand All @@ -183,6 +178,8 @@ If your nodes are running low on disk space, you have a few options:

* Upgrade your nodes to increase disk space.

* Add more nodes to the cluster.

* Delete unneeded indices to free up space. If you use {ilm-init}, you can
update your lifecycle policy to use <<ilm-searchable-snapshot,searchable
snapshots>> or add a delete phase. If you no longer need to search the data, you
Expand Down Expand Up @@ -215,11 +212,34 @@ watermark or set it to an explicit byte value.
PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.disk.watermark.low": "30gb"
"cluster.routing.allocation.disk.watermark.low": "90%",
thekofimensah marked this conversation as resolved.
Show resolved Hide resolved
"cluster.routing.allocation.disk.watermark.high": "95%"
}
}
----
// TEST[s/"30gb"/null/]
georgewallace marked this conversation as resolved.
Show resolved Hide resolved
**It is important to note that this is usually a temporary solution and may cause instability if the disk space is not freed up.**
thekofimensah marked this conversation as resolved.
Show resolved Hide resolved

[discrete]
[[fix-cluster-status-reenable-allocation]]
===== Re-enable shard allocation

You typically disable allocation during a <<restart-cluster,restart>> or other
cluster maintenance. If you forgot to re-enable allocation afterward, {es} will
be unable to assign shards. To re-enable allocation, reset the
`cluster.routing.allocation.enable` cluster setting.

[source,console]
----
PUT _cluster/settings
{
"persistent" : {
"cluster.routing.allocation.enable" : null
}
}
----

See https://www.youtube.com/watch?v=MiKKUdZvwnI[this video] for walkthrough of troubleshooting "no allocations are allowed".

[discrete]
[[fix-cluster-status-jvm]]
Expand Down Expand Up @@ -267,4 +287,4 @@ POST _cluster/reroute?metric=none
// TEST[s/^/PUT my-index\n/]
// TEST[catch:bad_request]

See https://www.youtube.com/watch?v=6OAg9IyXFO4[this video] for a walkthrough of troubleshooting `no_valid_shard_copy`.
See https://www.youtube.com/watch?v=6OAg9IyXFO4[this video] for a walkthrough of troubleshooting `no_valid_shard_copy`.
Loading