-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix][GCS] Implement reconnection for RedisContext #48781
base: master
Are you sure you want to change the base?
[Fix][GCS] Implement reconnection for RedisContext #48781
Conversation
5f0ee63
to
5f49408
Compare
5f49408
to
bfee08d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update the PR description to describe why GCS crashes when redis idle timeout is configured and what's the fix
bfee08d
to
52eae7c
Compare
@jjyao I found it a bit complicated, so I want to discuss the design and implementation here. I drew a simple class diagram. As you can see, However, I now need to implement reconnection. If a connection loss is detected in Here are the problems:
|
37aadc3
to
52c9ca4
Compare
f9e3422
to
b6d2d83
Compare
b6d2d83
to
0748886
Compare
3576106
to
6a3e35c
Compare
src/ray/gcs/redis_context.cc
Outdated
username_ = username; | ||
password_ = password; | ||
enable_ssl_ = enable_ssl; | ||
// Don't try to reconnect for the first time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was used to prevent infinite reconnection during startup because the failure in the first connection is usually due to user misconfiguration of the Redis server, such as an incorrect password. However, in #48781 (comment), I've changed it to a fatal error in the Connect
function if we fail to connect to the saved RAY_REDIS_ADDRESS
. Therefore, this is no longer needed.
src/ray/gcs/redis_context.cc
Outdated
Status RedisContext::Reconnect() { | ||
RAY_LOG(INFO) << "Try to reconnect to Redis server."; | ||
Disconnect(); | ||
return Connect(address_, port_, username_, password_, enable_ssl_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we going to have infinite reconnect if address_
is down?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change it to a fatal error, such that the GCS server will crash if it fails to connect to the saved address RAY_REDIS_ADDRESS
in the Connect
function.
ray/src/ray/gcs/redis_context.cc
Lines 706 to 708 in 2db1939
// If we failed to connect to the saved address RAY_REDIS_ADDRESS, then it's a fatal | |
// error. | |
RAY_CHECK_OK(ConnectToIPAddress(ip_address, port_)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do check failure first which matches the current behavior. Later on we can add retry different ip (since DNS may return multiple ips and currently we only try the first one).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does "later on" here mean implementing it in this PR or in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate PR.
aef5a16
to
f713214
Compare
context_.reset(); | ||
redis_async_context_.reset(); | ||
ResetSyncContext(); | ||
ResetAsyncContext(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this can cause nullptr exception since there might be some RedisRequestContext
that hold the raw pointer to redis async context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved nullptr check for redis async context to RedisContext::async_context
in this commit. RedisRequestContext
always calls RedisContext::async_context
to get the redis async context so this can prevent nullptr exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I changed the deleter for redis async context to use redisAsyncDisconnect
instead of redisAsyncFree
. redisAsyncDisconnect
calls redisAsyncFree
, but it tries to execute callbacks for all remaining replies before freeing the context.
ref:
Closes: ray-project#47419 Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#47419 Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#47419 Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#47419 Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#47419 Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#47419 Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#47419 Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#47419 Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#47419 Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#47419 Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#47419 Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#47419 Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#47419 Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#47419 Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#47419 Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#47419 Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#47419 Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#47419 Signed-off-by: Chi-Sheng Liu <[email protected]>
f713214
to
9779e4c
Compare
…yncDisconnect Closes: ray-project#47419 Signed-off-by: Chi-Sheng Liu <[email protected]>
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
|
Why are these changes needed?
When Redis is configured with an idle timeout, a connection that remains idle for too long will be closed by the server. Previously, we used fixed connections to communicate with the server, specifically the sync context and async context.
This PR implements reconnection to resolve this issue.
Additionally, for error replies, the async context already implements exponential retry, whereas the sync context does not. This PR also adds exponential retry for the sync context using the same connection.
Related issue number
Closes #47419
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.