Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix][GCS] Implement reconnection for RedisContext #48781

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

MortalHappiness
Copy link
Member

@MortalHappiness MortalHappiness commented Nov 18, 2024

Why are these changes needed?

When Redis is configured with an idle timeout, a connection that remains idle for too long will be closed by the server. Previously, we used fixed connections to communicate with the server, specifically the sync context and async context.

This PR implements reconnection to resolve this issue.

Additionally, for error replies, the async context already implements exponential retry, whereas the sync context does not. This PR also adds exponential retry for the sync context using the same connection.

Related issue number

Closes #47419

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@MortalHappiness MortalHappiness force-pushed the bugfix/#47419-gcs-server-redis-timeout branch 3 times, most recently from 5f0ee63 to 5f49408 Compare November 22, 2024 17:31
@MortalHappiness MortalHappiness marked this pull request as ready for review November 22, 2024 17:33
@MortalHappiness MortalHappiness requested a review from a team as a code owner November 22, 2024 17:33
@MortalHappiness MortalHappiness added the go add ONLY when ready to merge, run all tests label Nov 22, 2024
@MortalHappiness MortalHappiness changed the title [Fix][GCS] GCS server crash if redis configures idle timeout [Fix][GCS] GCS server crash if redis idle timeout is configured Nov 22, 2024
@MortalHappiness MortalHappiness force-pushed the bugfix/#47419-gcs-server-redis-timeout branch from 5f49408 to bfee08d Compare November 22, 2024 18:44
src/ray/gcs/redis_context.cc Show resolved Hide resolved
src/ray/gcs/redis_context.cc Outdated Show resolved Hide resolved
src/ray/gcs/redis_context.cc Outdated Show resolved Hide resolved
Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update the PR description to describe why GCS crashes when redis idle timeout is configured and what's the fix

src/ray/gcs/redis_context.cc Outdated Show resolved Hide resolved
src/ray/gcs/redis_context.cc Show resolved Hide resolved
@MortalHappiness MortalHappiness force-pushed the bugfix/#47419-gcs-server-redis-timeout branch from bfee08d to 52eae7c Compare November 25, 2024 14:24
@MortalHappiness MortalHappiness marked this pull request as draft November 26, 2024 18:59
@MortalHappiness
Copy link
Member Author

MortalHappiness commented Nov 26, 2024

image

@jjyao I found it a bit complicated, so I want to discuss the design and implementation here. I drew a simple class diagram. As you can see, RedisClient composes RedisContext and RedisAsioClient. Upon connection, RedisAsioClient will replace the callback function of redisAsyncContext in RedisAsyncContext. redisAsyncContext implicitly uses RedisAsioClient via the callback function. The original implementation assumes that these objects will not be recreated.

However, I now need to implement reconnection. If a connection loss is detected in RedisContext, I need to recreate both RedisAsyncContext and redisContext. But there is a dependency cycle: the lifetimes of RedisAsyncContext and RedisAsioClient are linked. This means I also need to recreate RedisAsioClient, which effectively requires me to call the Connect function of RedisClient again (this is because what RedisClient::Connect does is to call RedisContext::Connect and create a RedisAsioClient object)

Here are the problems:

  1. There is a comment in RedisClient stating that Connect and Disconnect are not thread-safe. So it may not be a good idea to call it multiple times.

    /// Connect to Redis. Non-thread safe.
    /// Call this function before calling other functions.
    ///
    /// \param io_service The event loop for this client.
    /// This io_service must be single-threaded. Because `RedisAsioClient` is
    /// non-thread safe.
    /// \return Status
    Status Connect(instrumented_io_context &io_service);
    /// Disconnect with Redis. Non-thread safe.
    void Disconnect();
    )

    • I'm unsure whether this comment is accurate. Since all members in RedisClient share the same instrumented_io_context &io_service, they should be thread-safe because they operate within the same event loop. Please correct me if I am wrong.
  2. Disconnection is detected in RedisContext, but RedisClient composes RedisContext. Calling RedisClient::Connect from RedisContext would make RedisContext dependent on RedisClient, which I believe is poor design.

@MortalHappiness MortalHappiness force-pushed the bugfix/#47419-gcs-server-redis-timeout branch 2 times, most recently from 37aadc3 to 52c9ca4 Compare November 27, 2024 19:22
@MortalHappiness MortalHappiness force-pushed the bugfix/#47419-gcs-server-redis-timeout branch 2 times, most recently from f9e3422 to b6d2d83 Compare December 5, 2024 21:51
@MortalHappiness MortalHappiness changed the title [Fix][GCS] GCS server crash if redis idle timeout is configured [Fix][GCS] Implement reconnection for RedisContext Dec 5, 2024
@MortalHappiness MortalHappiness force-pushed the bugfix/#47419-gcs-server-redis-timeout branch from b6d2d83 to 0748886 Compare December 7, 2024 10:00
@MortalHappiness MortalHappiness force-pushed the bugfix/#47419-gcs-server-redis-timeout branch 3 times, most recently from 3576106 to 6a3e35c Compare December 8, 2024 05:07
@MortalHappiness MortalHappiness marked this pull request as ready for review December 8, 2024 11:03
@MortalHappiness MortalHappiness marked this pull request as ready for review December 13, 2024 11:35
src/ray/gcs/redis_context.cc Outdated Show resolved Hide resolved
src/ray/gcs/redis_context.cc Outdated Show resolved Hide resolved
src/ray/gcs/redis_context.cc Outdated Show resolved Hide resolved
username_ = username;
password_ = password;
enable_ssl_ = enable_ssl;
// Don't try to reconnect for the first time.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was used to prevent infinite reconnection during startup because the failure in the first connection is usually due to user misconfiguration of the Redis server, such as an incorrect password. However, in #48781 (comment), I've changed it to a fatal error in the Connect function if we fail to connect to the saved RAY_REDIS_ADDRESS. Therefore, this is no longer needed.

Status RedisContext::Reconnect() {
RAY_LOG(INFO) << "Try to reconnect to Redis server.";
Disconnect();
return Connect(address_, port_, username_, password_, enable_ssl_);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to have infinite reconnect if address_ is down?

Copy link
Member Author

@MortalHappiness MortalHappiness Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change it to a fatal error, such that the GCS server will crash if it fails to connect to the saved address RAY_REDIS_ADDRESS in the Connect function.

// If we failed to connect to the saved address RAY_REDIS_ADDRESS, then it's a fatal
// error.
RAY_CHECK_OK(ConnectToIPAddress(ip_address, port_));

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do check failure first which matches the current behavior. Later on we can add retry different ip (since DNS may return multiple ips and currently we only try the first one).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does "later on" here mean implementing it in this PR or in a separate PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate PR.

@MortalHappiness MortalHappiness force-pushed the bugfix/#47419-gcs-server-redis-timeout branch 4 times, most recently from aef5a16 to f713214 Compare December 16, 2024 17:50
context_.reset();
redis_async_context_.reset();
ResetSyncContext();
ResetAsyncContext();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this can cause nullptr exception since there might be some RedisRequestContext that hold the raw pointer to redis async context?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved nullptr check for redis async context to RedisContext::async_context in this commit. RedisRequestContext always calls RedisContext::async_context to get the redis async context so this can prevent nullptr exception.

9779e4c

Copy link
Member Author

@MortalHappiness MortalHappiness Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I changed the deleter for redis async context to use redisAsyncDisconnect instead of redisAsyncFree. redisAsyncDisconnect calls redisAsyncFree, but it tries to execute callbacks for all remaining replies before freeing the context.

ec1e835

ref:

Closes: ray-project#47419
Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#47419
Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#47419
Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#47419
Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#47419
Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#47419
Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#47419
Signed-off-by: Chi-Sheng Liu <[email protected]>
@MortalHappiness MortalHappiness force-pushed the bugfix/#47419-gcs-server-redis-timeout branch from f713214 to 9779e4c Compare December 17, 2024 00:22
Copy link

stale bot commented Jan 22, 2025

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added stale The issue is stale. It will be closed within 7 days unless there are further conversation and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core]Ray head crashed silently - improve observability for redis timeouts causing said crash
5 participants