Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ratelimit presence updates #18000

Merged
merged 15 commits into from
Jan 24, 2025
Merged

Conversation

rda0
Copy link
Contributor

@rda0 rda0 commented Dec 5, 2024

This adds rate-limiting applied for presence updates per user.

This is my first contribution to synapse. The default settings probably need guidance from developers and should be tested on other server configurations. I will post a test report in the comments.

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Code style is correct
    (run the linters)

@rda0 rda0 requested a review from a team as a code owner December 5, 2024 18:38
@rda0 rda0 force-pushed the set-presence-ratelimit branch 2 times, most recently from 9ee8357 to 9a48930 Compare December 5, 2024 18:56
@rda0 rda0 force-pushed the set-presence-ratelimit branch from 9a48930 to aed1b1e Compare December 5, 2024 18:57
@rda0
Copy link
Contributor Author

rda0 commented Dec 5, 2024

I tested this PR on our production server (using v1.120.2) with 340 active users today. Here follows a test report that appears to fix #16843 for us. It shows the changes in CPU usage patterns when enabling or disabling the patch during our peak usage times.

Test timeline

2024-12-05T11:20:27+0100  systemctl restart nginx.service
2024-12-05T12:15:44+0100  systemctl restart netdata.service
2024-12-05T12:19:04+0100  systemctl restart synapse-phys.target                # 1 restart v1.120.2
2024-12-05T12:22:01+0100  mv env env_orig; mv env_patched env
2024-12-05T12:26:56+0100  vim /etc/opt/synapse/phys.ethz.ch.homeserver.yaml
2024-12-05T12:30:11+0100  mv env env_patched; mv env_orig env
2024-12-05T12:30:18+0100  systemctl restart synapse-phys.target                # 2 restart v1.120.2 patched
2024-12-05T12:30:51+0100  vim /etc/opt/synapse/phys.ethz.ch.homeserver.yaml    # set per_second: 0.02
2024-12-05T12:45:03+0100  systemctl restart synapse-phys.target                # 3 restart v1.120.2
2024-12-05T12:55:56+0100  systemctl restart synapse-phys.target                # 4 restart v1.120.2
2024-12-05T12:57:11+0100  mv env env_orig; mv env_patched env
2024-12-05T12:57:17+0100  systemctl restart synapse-phys.target                # 5 restart v1.120.2 patched (0.02)
2024-12-05T13:05:00+0100  cd /etc/opt/synapse/
2024-12-05T13:05:06+0100  vim phys.ethz.ch.homeserver.yaml                     # set per_second: 0.002
2024-12-05T13:05:34+0100  systemctl restart synapse-phys.target                # 6 restart v1.120.2 patched (0.002)
2024-12-05T13:52:17+0100  mv env env_patched; mv env_orig env
2024-12-05T13:52:20+0100  systemctl restart synapse-phys.target                # 7 restart v1.120.2
2024-12-05T13:58:36+0100  vim /etc/opt/synapse/phys.ethz.ch.homeserver.yaml    # set per_second: 0.01?
2024-12-05T13:59:22+0100  mv env env_orig; mv env_patched env
2024-12-05T13:59:25+0100  systemctl restart synapse-phys.target                # 8 restart v1.120.2 patched (0.01?)
2024-12-05T14:05:43+0100  vim /etc/opt/synapse/phys.ethz.ch.homeserver.yaml    # set ?
2024-12-05T14:17:19+0100  systemctl restart synapse-phys.target                # 9 restart v1.120.2 patched (?)
2024-12-05T14:18:32+0100  vim env/lib/python3.11/site-packages/synapse/rest/client/sync.py    # logger.info -> logger.debug (disabling the "User set_presence ratelimit exceeded; ignoring it." log spam, synapse log level was always set to info during these tests)
2024-12-05T14:19:09+0100  vim /etc/opt/synapse/phys.ethz.ch.homeserver.yaml    # set per_second: 0.02?
2024-12-05T14:20:27+0100  systemctl restart synapse-phys.target                # 10 restart v1.120.2 patched (0.02?)
2024-12-05T15:12:21+0100  mv env env_patched; mv env_orig env
2024-12-05T15:12:27+0100  systemctl restart synapse-phys.target                # 11 restart v1.120.2
2024-12-05T15:13:41+0100  vim /etc/opt/synapse/phys.ethz.ch.homeserver.yaml    # unset (use defaults: per_second: 0.1)
2024-12-05T15:18:38+0100  mv env env_orig; mv env_patched env
2024-12-05T15:18:41+0100  systemctl restart synapse-phys.target                # 12 restart v1.120.2 patched (0.1)

Numbered screenshots of CPU usage patterns

1
1
2
2
3
3
4
4
5
5
6
6
6 (repeating pattern)
6_repeating_pattern
7
7
8
8
9
9
10 (includes a code change of the debug logging from logger.info -> logger.debug (disabling the "User set_presence ratelimit exceeded; ignoring it." log spam, synapse log level was always set to info during these tests))
10
10 (repeating pattern)
10_repeating_pattern
11
11
12
12
12 (repeating pattern)
12_repeating_pattern

@rda0
Copy link
Contributor Author

rda0 commented Dec 10, 2024

After 2.5 work days the issue (#16843) did not reappear:

2024-12-10 13_53_02-View panel - phd-matrix synapse prometheus 20231213 - linux - Dashboards - Grafa

The day with the red line was the day of testing #18000 (comment). After the red line the PR is live with v1.120.2 using defaults (per_second: 0.1).

Copy link
Member

@anoadragon453 anoadragon453 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks good to me! This is only thing missing are unit tests (though thank you for the proof-in-production graphs!).

At minimum, could you write two tests:

  1. Send a presence update, check that it went through, immediately send another and check that it was ignored.
  2. Send a presence update, check that it went through, advancing time a sufficient amount, send another presence update and check that it also worked.

Presence-related tests go in PresenceUpdateTestCase. Here is an example of advancing time in a test. Time won't advance otherwise.

I'd recommend switching the presence state (offline -> online, etc.) so you can check that a change occurred.

Let me know or shout in #synapse-dev:matrix.org if you need further guidance!

docs/usage/configuration/config_documentation.md Outdated Show resolved Hide resolved
docs/usage/configuration/config_documentation.md Outdated Show resolved Hide resolved
@github-actions github-actions bot deployed to PR Documentation Preview December 20, 2024 18:47 Active
rda0 added 2 commits January 14, 2025 14:59
1. Send a presence update, check that it went through, immediately
   send another one and check that it was ignored.
2. Send a presence update, check that it went through, advancing
   time a sufficient amount, send another presence update and
   check that it also worked.
@rda0
Copy link
Contributor Author

rda0 commented Jan 14, 2025

@anoadragon453 I changed the docs as requested and implemented the unit tests. I Hope that looks good.

Copy link
Member

@anoadragon453 anoadragon453 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for continuing to update this! A couple things below.

tests/handlers/test_presence.py Show resolved Hide resolved
synapse/rest/client/sync.py Show resolved Hide resolved
@rda0 rda0 changed the title Ratelimit set_presence updates Ratelimit presence updates Jan 15, 2025
@rda0
Copy link
Contributor Author

rda0 commented Jan 15, 2025

@anoadragon453 I implemented the requested changes. I think the unit tests for the /_matrix/client/v3/presence rate limit fit better in tests/rest/client/test_presence.py. I also refactored the config option: rc_set_presence ->rc_presence to make more sense, since it now not only affects the set_presence option in sync.

@rda0 rda0 requested a review from anoadragon453 January 16, 2025 12:22
So that we don't get ratelimits while testing the logic of presence in our Complement test suite.
Copy link
Member

@anoadragon453 anoadragon453 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me; thanks @rda0!

docs/usage/configuration/config_documentation.md Outdated Show resolved Hide resolved
@anoadragon453 anoadragon453 enabled auto-merge (squash) January 20, 2025 18:05
@github-actions github-actions bot deployed to PR Documentation Preview January 20, 2025 18:06 Active
@anoadragon453
Copy link
Member

@rda0 looks like there are failures for the following tests:

  • tests.rest.client.test_sync.SyncCacheTestCase.test_noop_sync_does_not_tightloop
  • tests.rest.client.test_sync.DeviceListSyncTestCase_0__sync.test_not_receiving_local_device_list_changes

presumably as we're now receiving 429's in spots we didn't expect before.

@rda0
Copy link
Contributor Author

rda0 commented Jan 20, 2025

@anoadragon453 I was not sure about these two errors. I think it is not 429's, the tests time out:

$ poetry run trial -j8 tests.rest.client.test_sync.SyncCacheTestCase.test_noop_sync_does_not_tightloop tests.rest.client.test_sync.DeviceListSyncTestCase_0__sync.test_not_receiving_local_device_list_changes
The "poetry.dev-dependencies" section is deprecated and will be removed in a future version. Use "poetry.group.dev.dependencies" instead.
Running 2 tests.
tests.rest.client.test_sync
  DeviceListSyncTestCase_0__sync
    test_not_receiving_local_device_list_changes ...                    [ERROR]
  SyncCacheTestCase
    test_noop_sync_does_not_tightloop ...                               [ERROR]

===============================================================================
[ERROR]
Traceback (most recent call last):
  File "/home/matrix/synapse/tests/rest/client/test_sync.py", line 855, in test_not_receiving_local_device_list_changes
    bob_sync_channel.await_result()
  File "/home/matrix/synapse/tests/server.py", line 313, in await_result
    raise TimedOutException("Timed out waiting for request to finish.")
tests.server.TimedOutException: Timed out waiting for request to finish.

tests.rest.client.test_sync.DeviceListSyncTestCase_0__sync.test_not_receiving_local_device_list_changes
===============================================================================
[ERROR]
Traceback (most recent call last):
  File "/home/matrix/synapse/tests/rest/client/test_sync.py", line 685, in test_noop_sync_does_not_tightloop
    channel.await_result(timeout_ms=200)
  File "/home/matrix/synapse/tests/server.py", line 313, in await_result
    raise TimedOutException("Timed out waiting for request to finish.")
tests.server.TimedOutException: Timed out waiting for request to finish.

tests.rest.client.test_sync.SyncCacheTestCase.test_noop_sync_does_not_tightloop
-------------------------------------------------------------------------------
Ran 2 tests in 0.005s

When I change

await self.clock.sleep(0.5)

to await self.clock.sleep(0.1) both report OK or when I change
channel.await_result(timeout_ms=200)
to channel.await_result(timeout_ms=600) the test test_noop_sync_does_not_tightloop repost OK.

I am unsure what to do here, or how that is supposed to work. Could you check that?

It looks like the sleep in the rate limiter introduced in 52af16c now increases the timeout in

async def on_GET(self, request: SynapseRequest) -> Tuple[int, JsonDict]:
requests when set_presence is hitting the rate limit. I was not aware of that.

@rda0
Copy link
Contributor Author

rda0 commented Jan 22, 2025

I now patched our production server again with the following patch, to see if it makes a difference for our issue in #18000 (comment).

diff --git a/synapse/api/ratelimiting.py b/synapse/api/ratelimiting.py
index b80630c5d3..229329a5ae 100644
--- a/synapse/api/ratelimiting.py
+++ b/synapse/api/ratelimiting.py
@@ -275,6 +275,7 @@ class Ratelimiter:
         update: bool = True,
         n_actions: int = 1,
         _time_now_s: Optional[float] = None,
+        pause: Optional[float] = 0.5,
     ) -> None:
         """Checks if an action can be performed. If not, raises a LimitExceededError

@@ -298,6 +299,8 @@ class Ratelimiter:
                 at all.
             _time_now_s: The current time. Optional, defaults to the current time according
                 to self.clock. Only used by tests.
+            pause: Time in seconds to pause when an action is being limited. Defaults to 0.5
+                to stop clients from "tight-looping" on retrying their request.

         Raises:
             LimitExceededError: If an action could not be performed, along with the time in
@@ -316,9 +319,8 @@ class Ratelimiter:
         )

         if not allowed:
-            # We pause for a bit here to stop clients from "tight-looping" on
-            # retrying their request.
-            await self.clock.sleep(0.5)
+            if pause:
+                await self.clock.sleep(pause)

             raise LimitExceededError(
                 limiter_name=self._limiter_name,
diff --git a/synapse/rest/client/sync.py b/synapse/rest/client/sync.py
index b935830bae..e6742e8909 100644
--- a/synapse/rest/client/sync.py
+++ b/synapse/rest/client/sync.py
@@ -247,9 +247,9 @@ class SyncRestServlet(RestServlet):
         # send any outstanding server notices to the user.
         await self._server_notices_sender.on_user_syncing(user.to_string())

-        # ignore the presence update if the ratelimit is exceeded
+        # ignore the presence update if the ratelimit is exceeded but do not pause the request
         try:
-            await self._presence_per_user_limiter.ratelimit(requester)
+            await self._presence_per_user_limiter.ratelimit(requester, pause=0)
         except LimitExceededError:
             affect_presence = False
             logger.debug("User set_presence ratelimit exceeded; ignoring it.")

@anoadragon453 anoadragon453 force-pushed the set-presence-ratelimit branch from 0010383 to 7564438 Compare January 23, 2025 17:40
@github-actions github-actions bot deployed to PR Documentation Preview January 23, 2025 17:42 Active
@anoadragon453
Copy link
Member

@rda0 apologies, I had missed that we typically set high ratelimits for our unit tests to avoid them affecting unrelated tests: https://github.com/rda0/synapse/blob/ab10c2c0b140ad30d7edde652fcdc9da2154f417/tests/utils.py#L203

I've added rc_presence to the default homeserver test config, and override the config with a low rate-limit during your tests. All tests appear to pass now.

(I also corrected a typo in the Complement config - I had missed out the per_user field.)

@rda0
Copy link
Contributor Author

rda0 commented Jan 24, 2025

@anoadragon453

@rda0 apologies, I had missed that we typically set high ratelimits for our unit tests to avoid them affecting unrelated tests: https://github.com/rda0/synapse/blob/ab10c2c0b140ad30d7edde652fcdc9da2154f417/tests/utils.py#L203

No worries and thank you for your changes.

I've added rc_presence to the default homeserver test config, and override the config with a low rate-limit during your tests. All tests appear to pass now.

Does that mean I should revert 7423299?

(I also corrected a typo in the Complement config - I had missed out the per_user field.)

Some other things:

  1. Did you maybe accidentally force push over your previous changes 0010383 with https://github.com/element-hq/synapse/compare/001038392dd1215285f26b6c77574c2b5a48496b..756443817cbc184bfabf7fec9947d17c597f4deb?
  2. I still think this PR should include the change shown in Ratelimit presence updates #18000 (comment). Because currently this PR accidentally introduces a hardcoded sleep() of 0.5 seconds on every /sync GET request that includes a rate limited set_presence parameter. This pauses the actual /sync request and 0.5 seconds is quite a long time. This is certainly unintentional and probably unwanted by clients (I have seen Element sync with more than 10 Hz, although I find it questionable that a client syncs that much). Also the spec mentions that /sync should not be rate limited (but it kind of is now). However I have not noticed any difference in my client since we patched our production server.

I am still trying to test the change in #18000 (comment) but at the moment I am unable to reproduce the issue #16843 to verify the PR still works as intended without the pause (sleep(0.5)).

@anoadragon453
Copy link
Member

Does that mean I should revert 7423299?

Yes, I think we can avoid that now.

  1. Did you maybe accidentally force push over your previous changes 0010383 with https://github.com/element-hq/synapse/compare/001038392dd1215285f26b6c77574c2b5a48496b..756443817cbc184bfabf7fec9947d17c597f4deb?

The force push wasn't necessary, but no, that commit is still present:

https://github.com/element-hq/synapse/pull/18000/files#diff-1865c1e8307bb29e7f82bbaae74af9e1e09936211f514e8a8006cfd0c2bdea9dR1876-R1878

  1. I still think this PR should include the change shown in Ratelimit presence updates #18000 (comment).

Ah sorry, I didn't realise that's what you intended with your comment above. Yes, feel free to commit it - I agree with all your points.


I'm currently investigating the remaining test failure for test_put_presence_over_ratelimit.

@rda0
Copy link
Contributor Author

rda0 commented Jan 24, 2025

@anoadragon453 I was able to reproduce #16843 and verify that the PR still works as expected with 521b553. To reproduce just alternate /sync requests with online and unavailable at a high rate.

Selection_060

The commit 7423299 is reverted.

I'm currently investigating the remaining test failure for test_put_presence_over_ratelimit.

All tests pass locally when I run:

matrix@phd-build-bookworm:~/synapse[set-presence-ratelimit]$ poetry run trial -j8 tests
...
Ran 3918 tests in 820.609s

PASSED (skips=17, successes=3901)

For me everything looks good now. Let me know if there is anything else.

@github-actions github-actions bot deployed to PR Documentation Preview January 24, 2025 14:57 Active
Otherwise they're using the default, very high ratelimit config for the unit tests.
@anoadragon453
Copy link
Member

Sorry, it appears that I missed a couple spots when putting in the reduce ratelimit. I've corrected this in 8fe4cc9, and tests should all pass now.

Thanks for your persistence on this PR @rda0. Assuming CI now passes, we should be good to go!

@github-actions github-actions bot deployed to PR Documentation Preview January 24, 2025 15:11 Active
@anoadragon453 anoadragon453 enabled auto-merge (squash) January 24, 2025 15:11
@github-actions github-actions bot deployed to PR Documentation Preview January 24, 2025 15:12 Active
anoadragon453 added a commit to matrix-org/sytest that referenced this pull request Jan 24, 2025
This ratelimit is introduced in
element-hq/synapse#18000
and causes some existing Sytests to fail.

Increase the ratelimit in Sytest's Synapse config
so tests pass again.
anoadragon453 added a commit to matrix-org/sytest that referenced this pull request Jan 24, 2025
This ratelimit is introduced in
element-hq/synapse#18000
and causes some existing Sytests to fail.

Increase the ratelimit in Sytest's Synapse config
so tests pass again.
@anoadragon453
Copy link
Member

Sytests are failing because I forgot to bump the ratelimit there as well. This is now done in matrix-org/sytest#1394.

We'll need to merge that PR before this one so that other PRs don't fail CI.

@anoadragon453 anoadragon453 merged commit 9c5d08f into element-hq:develop Jan 24, 2025
41 checks passed
@anoadragon453
Copy link
Member

matrix-org/sytest#1394 merged and the CI here passed, hence merging this PR.

Thanks again @rda0!

@rda0
Copy link
Contributor Author

rda0 commented Jan 25, 2025

That went quick now👏. Thank you too @anoadragon453

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rate-Limiting for set_presence changes
2 participants