-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix disappearing HTTP requests with the push/pull pipeline pattern in ZeroMQ #134
Fix disappearing HTTP requests with the push/pull pipeline pattern in ZeroMQ #134
Conversation
@@ -158,9 +158,6 @@ server_t<request_container_t, request_info_t>::server_t( | |||
proxy.setsockopt(ZMQ_SNDHWM, &disabled, sizeof(disabled)); | |||
proxy.connect(proxy_endpoint.c_str()); | |||
|
|||
// TODO: consider making this into a pull socket so we dont lose any results due to timing | |||
loopback.setsockopt(ZMQ_RCVHWM, &disabled, sizeof(disabled)); | |||
loopback.setsockopt(ZMQ_SUBSCRIBE, "", 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to set any options for push/pull sockets?
to be completely honest i never much cared about this problem. my tactic has always been to start processes in such as way as to avoid this issue cropping up. in practice the http process never crashes (ive not seen it) so as long as its up before workers start completing work you dont miss anything. this works pretty nicely in eg kubernetes and other similar systems in that we have added a status endpoint that a system should wait until it returns 200 before sending live traffic to the http server. since the workers and server have to be up and running to reply to /status you wont miss any responses that you care about if you dont send traffic until that endpoint returns 200. anyway im not opposed to this if we are sure there aren't any drawbacks. the one you noted about not being able to recv from any random person publishing to the loopback is not something we need to worry about, i cant see how that would be a useful feature. however its been a long time since i read all the zmq documentation and i feel like i should familiarize myself again before we change the socket type. truthfully i wish i could remember why i didnt just use a push pull configuration from the beginning and opted to put that TODO in there about trying it. it was just too long ago to pull out of my memory.. |
@kevinkreiser Thank you for your insights! I'm gonna see if someone at @mapbox can help reviewing this PR. |
i'll review it, no one at mapbox has push rights to this repo anyway |
src/prime_server.cpp
Outdated
@@ -482,7 +479,6 @@ worker_t::worker_t(zmq::context_t& context, | |||
downstream_proxy.setsockopt(ZMQ_SNDHWM, &disabled, sizeof(disabled)); | |||
downstream_proxy.connect(downstream_proxy_endpoint.c_str()); | |||
|
|||
loopback.setsockopt(ZMQ_SNDHWM, &disabled, sizeof(disabled)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should revert this in keeping with the rest of the design of the project. its better to let the application max out ram than to arbitrarily limit to 1k or whatever the default is
@@ -158,9 +158,6 @@ server_t<request_container_t, request_info_t>::server_t( | |||
proxy.setsockopt(ZMQ_SNDHWM, &disabled, sizeof(disabled)); | |||
proxy.connect(proxy_endpoint.c_str()); | |||
|
|||
// TODO: consider making this into a pull socket so we dont lose any results due to timing | |||
loopback.setsockopt(ZMQ_RCVHWM, &disabled, sizeof(disabled)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please revert this line also for the same reason mentioned below
Thank you for your review! Should we do the same for |
definitely not. the interrupt socket is used to cancel work when the client closes the connection before receiving a response. to cancel the work the server then needs to use the interupt socket to signal any workers to give up on the work if they are indeed working on the job that has been canceled. if you switched it from pub/sub (server/worker) to push/pull then only one of the workers would get the signal and we cannot gaurantee which worker it will be. maybe worker A is working on the request that is to be canceled but worker B is the one who gets the interrupt signal. in this case worker B would just ignore it because its not working on that job at the moment. in otherwords we need the broadcast functionality of the pub/sub relationship for this feature to work proprly. |
macos has been broken for a little while and i wasnt able to fix it so ignoring that and merging anyway |
Thank you so much for the maintenance work and detailed explanation! 🥳 |
This PR prevents HTTP requests to be absorbed into
http_server_t
by replacing pub/sub sockets with push/pull ones for HTTP server loopback on, for example, the initial launch of microservices built withprime_server
.This change should also allow us to remove the following fast-fail directives used in flaky tests in Valhalla. At least, I've never seen these tests' random failures after this change on my local machine and Circle CI.
https://github.com/valhalla/valhalla/blob/9725261d1c33d6fa361d3d0eec12404f79ca150d/test/loki_service.cc#L611
https://github.com/valhalla/valhalla/blob/9725261d1c33d6fa361d3d0eec12404f79ca150d/test/skadi_service.cc#L245
Questions
I'm not familiar with all the use cases of
prime_server
. In my understanding, now that we use a different type of sockets for HTTP request/response loopback, we can't send messages from random clients but only from workers with loopabck sockets properly configured to the HTTP server's loopback socket. Is that fine?