-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] [1/N] Ray syncer set observer #49122
[core] [1/N] Ray syncer set observer #49122
Conversation
137a067
to
637ff45
Compare
Signed-off-by: hjiang <[email protected]>
637ff45
to
5f88358
Compare
Hi @jjyao , could you please help review this PR? Thank you! |
src/ray/common/ray_syncer/common.h
Outdated
using RaySyncMsgObserver = | ||
std::function<void(const ::ray::rpc::syncer::RaySyncMessage &)>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we instead define a generic
// A callback is called whenver a rpc completes between the current process and the given raylet regardless of whether the response indicates success or failure
using RayletCompletedRpcCallback(void(const NodeID &))
and this callback can be set on any components that have rpcs with raylet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regardless of whether the response indicates success or failure
I'm not sure this is correct. For example, if you get an error status on grpc (i.e. UNAVAILABLE for grpc status), it actually means communication between gcs and raylet is broken, which we shouldn't record completion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here Completed
means client sends the request and then receives the response
so if this is UNAVAILABLE, it's not completed. This is the naming suggested by chatgpt lol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we can think Completed
== GrpcStatus::OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But feel free to give a better name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed Raylet
from the comment and type alias, the whole RaySyncer
is pretty general-purpose, which doesn't couple with any component in terms of implementation.
Signed-off-by: hjiang <[email protected]>
if (on_raylet_rpc_completion_) { | ||
on_raylet_rpc_completion_(NodeID::FromBinary(message->node_id())); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the best place to trigger the callback is RaySyncerBidiReactorBase::OnReadDone()
because NodeState::ConsumeSyncMessage
can be called in different settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I think it's good to update right after a successful grpc call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I recall why I put it in the NodeState
at the first place, reactors are stored per-node as a map inside of RaySyncer
:
ray/src/ray/common/ray_syncer/ray_syncer.h
Lines 162 to 166 in 2e4a126
/// Manage connections. Here the key is the NodeID in binary form. | |
absl::flat_hash_map<std::string, RaySyncerBidiReactor *> sync_reactors_; | |
/// The local node state | |
std::unique_ptr<NodeState> node_state_; |
which means if you tie the logic grpc callback, you have to store it somewhere and pass it to every reactor construction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation updated.
Signed-off-by: hjiang <[email protected]>
Signed-off-by: hjiang <[email protected]>
4c8bb2c
to
9a9fd9e
Compare
Signed-off-by: hjiang <[email protected]>
Signed-off-by: hjiang <[email protected]>
Signed-off-by: hjiang <[email protected]>
Signed-off-by: hjiang <[email protected]>
Signed-off-by: hjiang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG
@@ -93,11 +94,24 @@ class RaySyncerBidiReactor { | |||
} | |||
}; | |||
|
|||
/// Set rpc completion callback, which is called after rpc read finishes. | |||
/// This function is expected to call only once, repeated invocations throws exception. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// This function is expected to call only once, repeated invocations throws exception. | |
/// This function is expected to call only once, repeated invocations will check fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why "will check fail" is a better wording?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway I updated the wording, but would like to hear the reasoning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check fail will just exit the process so no c++ exception is thrown?
Signed-off-by: dentiny <[email protected]>
Signed-off-by: dentiny <[email protected]>
Signed-off-by: hjiang <[email protected]> Signed-off-by: ujjawal-khare <[email protected]>
Resolves issue: #48837
This PR is the first part for reducing communication between GCS/raylet.
The functionality is:
<node id, updated timestamp>
which is stored in health check manager (in next PR)