Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix disappearing HTTP requests with the push/pull pipeline pattern in ZeroMQ #134

Merged

Conversation

raviqqe
Copy link
Contributor

@raviqqe raviqqe commented Jul 16, 2024

This PR prevents HTTP requests to be absorbed into http_server_t by replacing pub/sub sockets with push/pull ones for HTTP server loopback on, for example, the initial launch of microservices built with prime_server.

This change should also allow us to remove the following fast-fail directives used in flaky tests in Valhalla. At least, I've never seen these tests' random failures after this change on my local machine and Circle CI.

https://github.com/valhalla/valhalla/blob/9725261d1c33d6fa361d3d0eec12404f79ca150d/test/loki_service.cc#L611

https://github.com/valhalla/valhalla/blob/9725261d1c33d6fa361d3d0eec12404f79ca150d/test/skadi_service.cc#L245

Questions

I'm not familiar with all the use cases of prime_server. In my understanding, now that we use a different type of sockets for HTTP request/response loopback, we can't send messages from random clients but only from workers with loopabck sockets properly configured to the HTTP server's loopback socket. Is that fine?

@@ -158,9 +158,6 @@ server_t<request_container_t, request_info_t>::server_t(
proxy.setsockopt(ZMQ_SNDHWM, &disabled, sizeof(disabled));
proxy.connect(proxy_endpoint.c_str());

// TODO: consider making this into a pull socket so we dont lose any results due to timing
loopback.setsockopt(ZMQ_RCVHWM, &disabled, sizeof(disabled));
loopback.setsockopt(ZMQ_SUBSCRIBE, "", 0);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to set any options for push/pull sockets?

@raviqqe raviqqe marked this pull request as ready for review July 16, 2024 08:45
@kevinkreiser kevinkreiser self-requested a review July 16, 2024 15:14
@kevinkreiser
Copy link
Owner

to be completely honest i never much cared about this problem. my tactic has always been to start processes in such as way as to avoid this issue cropping up. in practice the http process never crashes (ive not seen it) so as long as its up before workers start completing work you dont miss anything. this works pretty nicely in eg kubernetes and other similar systems in that we have added a status endpoint that a system should wait until it returns 200 before sending live traffic to the http server. since the workers and server have to be up and running to reply to /status you wont miss any responses that you care about if you dont send traffic until that endpoint returns 200.

anyway im not opposed to this if we are sure there aren't any drawbacks. the one you noted about not being able to recv from any random person publishing to the loopback is not something we need to worry about, i cant see how that would be a useful feature. however its been a long time since i read all the zmq documentation and i feel like i should familiarize myself again before we change the socket type. truthfully i wish i could remember why i didnt just use a push pull configuration from the beginning and opted to put that TODO in there about trying it. it was just too long ago to pull out of my memory..

@raviqqe
Copy link
Contributor Author

raviqqe commented Jul 29, 2024

@kevinkreiser Thank you for your insights! I'm gonna see if someone at @mapbox can help reviewing this PR.

@kevinkreiser
Copy link
Owner

kevinkreiser commented Jul 29, 2024

i'll review it, no one at mapbox has push rights to this repo anyway

@@ -482,7 +479,6 @@ worker_t::worker_t(zmq::context_t& context,
downstream_proxy.setsockopt(ZMQ_SNDHWM, &disabled, sizeof(disabled));
downstream_proxy.connect(downstream_proxy_endpoint.c_str());

loopback.setsockopt(ZMQ_SNDHWM, &disabled, sizeof(disabled));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should revert this in keeping with the rest of the design of the project. its better to let the application max out ram than to arbitrarily limit to 1k or whatever the default is

@@ -158,9 +158,6 @@ server_t<request_container_t, request_info_t>::server_t(
proxy.setsockopt(ZMQ_SNDHWM, &disabled, sizeof(disabled));
proxy.connect(proxy_endpoint.c_str());

// TODO: consider making this into a pull socket so we dont lose any results due to timing
loopback.setsockopt(ZMQ_RCVHWM, &disabled, sizeof(disabled));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revert this line also for the same reason mentioned below

@raviqqe
Copy link
Contributor Author

raviqqe commented Jul 30, 2024

Thank you for your review! Should we do the same for interrupt too?

@kevinkreiser
Copy link
Owner

Thank you for your review! Should we do the same for interrupt too?

definitely not. the interrupt socket is used to cancel work when the client closes the connection before receiving a response. to cancel the work the server then needs to use the interupt socket to signal any workers to give up on the work if they are indeed working on the job that has been canceled. if you switched it from pub/sub (server/worker) to push/pull then only one of the workers would get the signal and we cannot gaurantee which worker it will be. maybe worker A is working on the request that is to be canceled but worker B is the one who gets the interrupt signal. in this case worker B would just ignore it because its not working on that job at the moment. in otherwords we need the broadcast functionality of the pub/sub relationship for this feature to work proprly.

@kevinkreiser
Copy link
Owner

macos has been broken for a little while and i wasnt able to fix it so ignoring that and merging anyway

@kevinkreiser kevinkreiser merged commit 4508553 into kevinkreiser:master Jul 30, 2024
1 of 2 checks passed
@raviqqe
Copy link
Contributor Author

raviqqe commented Jul 31, 2024

Thank you so much for the maintenance work and detailed explanation! 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants