-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ratelimit presence updates #18000
Ratelimit presence updates #18000
Conversation
9ee8357
to
9a48930
Compare
9a48930
to
aed1b1e
Compare
I tested this PR on our production server (using v1.120.2) with 340 active users today. Here follows a test report that appears to fix #16843 for us. It shows the changes in CPU usage patterns when enabling or disabling the patch during our peak usage times. Test timeline2024-12-05T11:20:27+0100 systemctl restart nginx.service
2024-12-05T12:15:44+0100 systemctl restart netdata.service
2024-12-05T12:19:04+0100 systemctl restart synapse-phys.target # 1 restart v1.120.2
2024-12-05T12:22:01+0100 mv env env_orig; mv env_patched env
2024-12-05T12:26:56+0100 vim /etc/opt/synapse/phys.ethz.ch.homeserver.yaml
2024-12-05T12:30:11+0100 mv env env_patched; mv env_orig env
2024-12-05T12:30:18+0100 systemctl restart synapse-phys.target # 2 restart v1.120.2 patched
2024-12-05T12:30:51+0100 vim /etc/opt/synapse/phys.ethz.ch.homeserver.yaml # set per_second: 0.02
2024-12-05T12:45:03+0100 systemctl restart synapse-phys.target # 3 restart v1.120.2
2024-12-05T12:55:56+0100 systemctl restart synapse-phys.target # 4 restart v1.120.2
2024-12-05T12:57:11+0100 mv env env_orig; mv env_patched env
2024-12-05T12:57:17+0100 systemctl restart synapse-phys.target # 5 restart v1.120.2 patched (0.02)
2024-12-05T13:05:00+0100 cd /etc/opt/synapse/
2024-12-05T13:05:06+0100 vim phys.ethz.ch.homeserver.yaml # set per_second: 0.002
2024-12-05T13:05:34+0100 systemctl restart synapse-phys.target # 6 restart v1.120.2 patched (0.002)
2024-12-05T13:52:17+0100 mv env env_patched; mv env_orig env
2024-12-05T13:52:20+0100 systemctl restart synapse-phys.target # 7 restart v1.120.2
2024-12-05T13:58:36+0100 vim /etc/opt/synapse/phys.ethz.ch.homeserver.yaml # set per_second: 0.01?
2024-12-05T13:59:22+0100 mv env env_orig; mv env_patched env
2024-12-05T13:59:25+0100 systemctl restart synapse-phys.target # 8 restart v1.120.2 patched (0.01?)
2024-12-05T14:05:43+0100 vim /etc/opt/synapse/phys.ethz.ch.homeserver.yaml # set ?
2024-12-05T14:17:19+0100 systemctl restart synapse-phys.target # 9 restart v1.120.2 patched (?)
2024-12-05T14:18:32+0100 vim env/lib/python3.11/site-packages/synapse/rest/client/sync.py # logger.info -> logger.debug (disabling the "User set_presence ratelimit exceeded; ignoring it." log spam, synapse log level was always set to info during these tests)
2024-12-05T14:19:09+0100 vim /etc/opt/synapse/phys.ethz.ch.homeserver.yaml # set per_second: 0.02?
2024-12-05T14:20:27+0100 systemctl restart synapse-phys.target # 10 restart v1.120.2 patched (0.02?)
2024-12-05T15:12:21+0100 mv env env_patched; mv env_orig env
2024-12-05T15:12:27+0100 systemctl restart synapse-phys.target # 11 restart v1.120.2
2024-12-05T15:13:41+0100 vim /etc/opt/synapse/phys.ethz.ch.homeserver.yaml # unset (use defaults: per_second: 0.1)
2024-12-05T15:18:38+0100 mv env env_orig; mv env_patched env
2024-12-05T15:18:41+0100 systemctl restart synapse-phys.target # 12 restart v1.120.2 patched (0.1) Numbered screenshots of CPU usage patterns1 |
After 2.5 work days the issue (#16843) did not reappear: The day with the red line was the day of testing #18000 (comment). After the red line the PR is live with v1.120.2 using defaults ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation looks good to me! This is only thing missing are unit tests (though thank you for the proof-in-production graphs!).
At minimum, could you write two tests:
- Send a presence update, check that it went through, immediately send another and check that it was ignored.
- Send a presence update, check that it went through, advancing time a sufficient amount, send another presence update and check that it also worked.
Presence-related tests go in PresenceUpdateTestCase
. Here is an example of advancing time in a test. Time won't advance otherwise.
I'd recommend switching the presence state (offline
-> online
, etc.) so you can check that a change occurred.
Let me know or shout in #synapse-dev:matrix.org if you need further guidance!
1. Send a presence update, check that it went through, immediately send another one and check that it was ignored. 2. Send a presence update, check that it went through, advancing time a sufficient amount, send another presence update and check that it also worked.
@anoadragon453 I changed the docs as requested and implemented the unit tests. I Hope that looks good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for continuing to update this! A couple things below.
@anoadragon453 I implemented the requested changes. I think the unit tests for the |
So that we don't get ratelimits while testing the logic of presence in our Complement test suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me; thanks @rda0!
@rda0 looks like there are failures for the following tests:
presumably as we're now receiving 429's in spots we didn't expect before. |
@anoadragon453 I was not sure about these two errors. I think it is not 429's, the tests time out: $ poetry run trial -j8 tests.rest.client.test_sync.SyncCacheTestCase.test_noop_sync_does_not_tightloop tests.rest.client.test_sync.DeviceListSyncTestCase_0__sync.test_not_receiving_local_device_list_changes
The "poetry.dev-dependencies" section is deprecated and will be removed in a future version. Use "poetry.group.dev.dependencies" instead.
Running 2 tests.
tests.rest.client.test_sync
DeviceListSyncTestCase_0__sync
test_not_receiving_local_device_list_changes ... [ERROR]
SyncCacheTestCase
test_noop_sync_does_not_tightloop ... [ERROR]
===============================================================================
[ERROR]
Traceback (most recent call last):
File "/home/matrix/synapse/tests/rest/client/test_sync.py", line 855, in test_not_receiving_local_device_list_changes
bob_sync_channel.await_result()
File "/home/matrix/synapse/tests/server.py", line 313, in await_result
raise TimedOutException("Timed out waiting for request to finish.")
tests.server.TimedOutException: Timed out waiting for request to finish.
tests.rest.client.test_sync.DeviceListSyncTestCase_0__sync.test_not_receiving_local_device_list_changes
===============================================================================
[ERROR]
Traceback (most recent call last):
File "/home/matrix/synapse/tests/rest/client/test_sync.py", line 685, in test_noop_sync_does_not_tightloop
channel.await_result(timeout_ms=200)
File "/home/matrix/synapse/tests/server.py", line 313, in await_result
raise TimedOutException("Timed out waiting for request to finish.")
tests.server.TimedOutException: Timed out waiting for request to finish.
tests.rest.client.test_sync.SyncCacheTestCase.test_noop_sync_does_not_tightloop
-------------------------------------------------------------------------------
Ran 2 tests in 0.005s When I change synapse/synapse/api/ratelimiting.py Line 321 in 0a31cf1
to await self.clock.sleep(0.1) both report OK or when I change synapse/tests/rest/client/test_sync.py Line 685 in 0a31cf1
channel.await_result(timeout_ms=600) the test test_noop_sync_does_not_tightloop repost OK .
I am unsure what to do here, or how that is supposed to work. Could you check that? It looks like the sleep in the rate limiter introduced in 52af16c now increases the timeout in synapse/synapse/rest/client/sync.py Line 129 in 0a31cf1
set_presence is hitting the rate limit. I was not aware of that.
|
I now patched our production server again with the following patch, to see if it makes a difference for our issue in #18000 (comment). diff --git a/synapse/api/ratelimiting.py b/synapse/api/ratelimiting.py
index b80630c5d3..229329a5ae 100644
--- a/synapse/api/ratelimiting.py
+++ b/synapse/api/ratelimiting.py
@@ -275,6 +275,7 @@ class Ratelimiter:
update: bool = True,
n_actions: int = 1,
_time_now_s: Optional[float] = None,
+ pause: Optional[float] = 0.5,
) -> None:
"""Checks if an action can be performed. If not, raises a LimitExceededError
@@ -298,6 +299,8 @@ class Ratelimiter:
at all.
_time_now_s: The current time. Optional, defaults to the current time according
to self.clock. Only used by tests.
+ pause: Time in seconds to pause when an action is being limited. Defaults to 0.5
+ to stop clients from "tight-looping" on retrying their request.
Raises:
LimitExceededError: If an action could not be performed, along with the time in
@@ -316,9 +319,8 @@ class Ratelimiter:
)
if not allowed:
- # We pause for a bit here to stop clients from "tight-looping" on
- # retrying their request.
- await self.clock.sleep(0.5)
+ if pause:
+ await self.clock.sleep(pause)
raise LimitExceededError(
limiter_name=self._limiter_name,
diff --git a/synapse/rest/client/sync.py b/synapse/rest/client/sync.py
index b935830bae..e6742e8909 100644
--- a/synapse/rest/client/sync.py
+++ b/synapse/rest/client/sync.py
@@ -247,9 +247,9 @@ class SyncRestServlet(RestServlet):
# send any outstanding server notices to the user.
await self._server_notices_sender.on_user_syncing(user.to_string())
- # ignore the presence update if the ratelimit is exceeded
+ # ignore the presence update if the ratelimit is exceeded but do not pause the request
try:
- await self._presence_per_user_limiter.ratelimit(requester)
+ await self._presence_per_user_limiter.ratelimit(requester, pause=0)
except LimitExceededError:
affect_presence = False
logger.debug("User set_presence ratelimit exceeded; ignoring it.") |
I forgot a field!
0010383
to
7564438
Compare
@rda0 apologies, I had missed that we typically set high ratelimits for our unit tests to avoid them affecting unrelated tests: https://github.com/rda0/synapse/blob/ab10c2c0b140ad30d7edde652fcdc9da2154f417/tests/utils.py#L203 I've added (I also corrected a typo in the Complement config - I had missed out the |
No worries and thank you for your changes.
Does that mean I should revert 7423299?
Some other things:
I am still trying to test the change in #18000 (comment) but at the moment I am unable to reproduce the issue #16843 to verify the PR still works as intended without the pause ( |
Yes, I think we can avoid that now.
The force push wasn't necessary, but no, that commit is still present:
Ah sorry, I didn't realise that's what you intended with your comment above. Yes, feel free to commit it - I agree with all your points. I'm currently investigating the remaining test failure for |
@anoadragon453 I was able to reproduce #16843 and verify that the PR still works as expected with 521b553. To reproduce just alternate The commit 7423299 is reverted.
All tests pass locally when I run:
For me everything looks good now. Let me know if there is anything else. |
Otherwise they're using the default, very high ratelimit config for the unit tests.
This ratelimit is introduced in element-hq/synapse#18000 and causes some existing Sytests to fail. Increase the ratelimit in Sytest's Synapse config so tests pass again.
This ratelimit is introduced in element-hq/synapse#18000 and causes some existing Sytests to fail. Increase the ratelimit in Sytest's Synapse config so tests pass again.
Sytests are failing because I forgot to bump the ratelimit there as well. This is now done in matrix-org/sytest#1394. We'll need to merge that PR before this one so that other PRs don't fail CI. |
matrix-org/sytest#1394 merged and the CI here passed, hence merging this PR. Thanks again @rda0! |
That went quick now👏. Thank you too @anoadragon453 |
This adds rate-limiting applied for presence updates per user.
This is my first contribution to synapse. The default settings probably need guidance from developers and should be tested on other server configurations. I will post a test report in the comments.
Pull Request Checklist
EventStore
toEventWorkerStore
.".code blocks
.(run the linters)