Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

revert: fixed window ratelimiting #2116

Merged
merged 12 commits into from
Sep 20, 2024
Merged

revert: fixed window ratelimiting #2116

merged 12 commits into from
Sep 20, 2024

Conversation

chronark
Copy link
Collaborator

@chronark chronark commented Sep 20, 2024

  • chore: switch file names
  • revert: fixed window ratelimit

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced a RealClock implementation for accurate timekeeping.
    • Enhanced rate limiting with a new circuit breaker for mitigation requests.
    • Added a new GitHub Actions workflow for local integration testing of the agent.
  • Improvements

    • Simplified caching mechanism in the rate limiter by removing the blocked state.
    • Improved error handling with a retry mechanism for agent calls.
  • Bug Fixes

    • Adjusted locking behavior in the mitigation process for better performance.
  • Tests

    • Expanded test coverage for rate limiting with additional cluster sizes and refined logic for accuracy.
    • Updated test naming and ensured all tests are executed as part of the suite.

@chronark chronark requested a review from perkinsjr as a code owner September 20, 2024 13:00
Copy link

changeset-bot bot commented Sep 20, 2024

⚠️ No Changeset found

Latest commit: b5e5529

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Copy link

vercel bot commented Sep 20, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
dashboard ✅ Ready (Inspect) Visit Preview 💬 Add feedback Sep 20, 2024 4:49pm
planetfall ✅ Ready (Inspect) Visit Preview 💬 Add feedback Sep 20, 2024 4:49pm
play ✅ Ready (Inspect) Visit Preview 💬 Add feedback Sep 20, 2024 4:49pm
workflows ✅ Ready (Inspect) Visit Preview 💬 Add feedback Sep 20, 2024 4:49pm
www ✅ Ready (Inspect) Visit Preview 💬 Add feedback Sep 20, 2024 4:49pm

Copy link
Contributor

coderabbitai bot commented Sep 20, 2024

Walkthrough

Walkthrough

The changes involve a significant restructuring of the clock and rate limiting functionalities within the application. The TestClock has been replaced with a RealClock, shifting from a mockable clock to a real-time clock implementation. Additionally, the rate limiting logic has been enhanced with a new circuit breaker for mitigation requests, and the rate limiting tests have been expanded and refined. A new GitHub Actions workflow has also been introduced to facilitate local testing.

Changes

File Path Change Summary
apps/agent/pkg/clock/real_clock.go Transitioned from TestClock to RealClock, removing mockable clock functionality. Introduced New function for RealClock and updated Now method to return current time.
apps/agent/pkg/clock/test_clock.go Introduced TestClock struct with methods for managing time (Now, Tick, Set). Added NewTestClock constructor for creating instances.
apps/agent/services/ratelimit/mitigate.go Modified locking behavior in Mitigate method and enhanced error handling in broadcastMitigation method using a circuit breaker.
apps/agent/services/ratelimit/ratelimit_mitigation_test.go Expanded test coverage for rate limiting by including more cluster sizes and refining assertions.
apps/agent/services/ratelimit/ratelimit_replication_test.go Renamed TestReplication to TestSync and removed skip statement to ensure the test runs.
apps/agent/services/ratelimit/ratelimit_test.go Restructured TestAccuracy_fixed_time to iterate over multiple cluster sizes and improved organization of test logic. Removed loadTest function.
.github/workflows/job_test_agent_local.yaml Introduced a new GitHub Actions workflow for local testing of the agent application.
.github/workflows/pr.yaml Added a new job test_agent_local to the existing GitHub Actions workflow, expanding testing capabilities.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Agent
    participant RateLimiter
    participant Clock

    User->>Agent: Request
    Agent->>RateLimiter: Check Rate Limit
    RateLimiter->>Clock: Get Current Time
    Clock-->>RateLimiter: Current Time
    RateLimiter-->>Agent: Rate Limit Status
    Agent-->>User: Response
Loading

Possibly related PRs

  • fix: cf cache ratelimits #2112: The changes in this PR focus on fixing rate limits, which may relate to the overall functionality of the clock implementation in the main PR, particularly if rate limiting is affected by time management. However, there is no direct code-level connection to the clock package changes.

Suggested reviewers

  • perkinsjr

Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between cf97625 and b5e5529.

Files selected for processing (2)
  • apps/agent/pkg/circuitbreaker/lib.go (0 hunks)
  • apps/agent/services/ratelimit/ratelimit_replication_test.go (2 hunks)
Files not reviewed due to no reviewable changes (1)
  • apps/agent/pkg/circuitbreaker/lib.go
Files skipped from review as they are similar to previous changes (1)
  • apps/agent/services/ratelimit/ratelimit_replication_test.go

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    -- I pushed a fix in commit <commit_id>, please review it.
    -- Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    -- @coderabbitai generate unit testing code for this file.
    -- @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    -- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    -- @coderabbitai read src/utils.ts and generate unit testing code.
    -- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    -- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

github-actions bot commented Sep 20, 2024

Thank you for following the naming conventions for pull request titles! 🙏

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Outside diff range and nitpick comments (3)
apps/agent/services/ratelimit/ratelimit_test.go (1)

70-70: Remove redundant error check for err

At line 70, there is a redundant require.NoError(t, err) statement. The error err has not been modified since the previous check at line 68. This extra check is unnecessary.

Apply this diff to remove the redundant line:

-    require.NoError(t, err)
apps/api/src/pkg/ratelimit/client.ts (1)

Line range hint 130-202: Consider using a more efficient cache eviction strategy

In the setCacheMax method, cache eviction is performed by iterating over entries and deleting those that have expired when the cache size exceeds maxEntries. This could become inefficient as the cache grows.

Consider using a Least Recently Used (LRU) cache or a similar data structure that handles eviction more efficiently. This can improve performance and reduce latency caused by cache maintenance.

Example using an LRU cache:

Implement an LRU cache mechanism or utilize an existing library to manage cache entries based on their usage and expiry.

apps/agent/services/ratelimit/sliding_window.go (1)

113-114: Typo in comment: 'cachelayer' should be 'cache layer'

In the comment, "we are reverting this to fixed-window until we can get rid of the cloudflare cachelayer", "cachelayer" should be "cache layer" for clarity.

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between cc0eedd and ad026f4.

Files selected for processing (9)
  • apps/agent/pkg/clock/real_clock.go (1 hunks)
  • apps/agent/pkg/clock/test_clock.go (1 hunks)
  • apps/agent/services/ratelimit/mitigate.go (2 hunks)
  • apps/agent/services/ratelimit/ratelimit_mitigation_test.go (2 hunks)
  • apps/agent/services/ratelimit/ratelimit_replication_test.go (1 hunks)
  • apps/agent/services/ratelimit/ratelimit_test.go (1 hunks)
  • apps/agent/services/ratelimit/service.go (2 hunks)
  • apps/agent/services/ratelimit/sliding_window.go (4 hunks)
  • apps/api/src/pkg/ratelimit/client.ts (9 hunks)
Additional comments not posted (27)
apps/agent/pkg/clock/real_clock.go (4)

5-6: LGTM!

The RealClock struct is defined correctly as an empty struct, indicating that it does not maintain any internal state. This aligns with the transition from a mockable clock to a real-time clock implementation.


8-10: LGTM!

The New() function is implemented correctly as a constructor for creating instances of RealClock. It returns a pointer to a new RealClock instance without any parameters, which aligns with the AI-generated summary.


12-12: LGTM!

The variable declaration var _ Clock = &RealClock{} is used correctly to ensure that RealClock implements the Clock interface at compile-time. This aligns with the AI-generated summary and is a common pattern in Go for interface checks.


14-16: LGTM!

The Now() method on the RealClock struct is implemented correctly to return the current time using time.Now(). This aligns with the AI-generated summary and provides the expected functionality for a real-time clock.

apps/agent/pkg/clock/test_clock.go (6)

5-6: LGTM!

The TestClock struct is well-defined and serves the purpose of mocking time in tests. The now field accurately represents the current time of the test clock.


9-13: LGTM!

The NewTestClock constructor function is implemented correctly. It properly handles the optional now parameter and defaults to the current time when no initial time is provided. The function returns a pointer to the newly created TestClock instance, which is the expected behavior.


16-16: LGTM!

The interface implementation at line 16 correctly verifies that TestClock satisfies the Clock interface. This ensures that TestClock can be used wherever a Clock is expected.


18-20: LGTM!

The Now method is implemented correctly. It returns the current time of the test clock by returning the value of the now field. The method logic is straightforward and has no issues.


23-26: LGTM!

The Tick method is implemented correctly. It advances the clock by the given duration, updates the now field to reflect the new time, and returns the updated time. The method logic is sound and serves the purpose of simulating the passage of time in tests.


29-31: LGTM!

The Set method is implemented correctly. It sets the clock to the given time, updates the now field to reflect the new time, and returns the updated time. The method logic is straightforward and serves the purpose of setting the test clock to a specific time for testing scenarios.

apps/agent/services/ratelimit/ratelimit_mitigation_test.go (4)

27-27: LGTM!

The expanded range of cluster sizes improves the test coverage by including both small and large clusters. This change enhances the robustness of the rate limiting tests.


97-97: Good catch!

The modified loop condition fixes an off-by-one error and ensures that the rate limit is saturated with exactly limit requests. This change improves the accuracy of the test.


103-103: Excellent fix!

The modified assertion correctly checks that the rate limit response is unsuccessful after saturation. This change improves the correctness and reliability of the test by validating the expected rate limiting behavior.


111-115: Nice touch!

Correcting the typo in the comment improves the clarity and readability of the code. While it doesn't affect the functionality, it enhances the overall code quality and maintainability.

apps/agent/services/ratelimit/service.go (2)

41-42: LGTM!

The addition of the mitigateCircuitBreaker field is a good enhancement to handle mitigation requests using a dedicated circuit breaker. This can improve the resilience and fault tolerance of the service when dealing with mitigation requests.

It's also good to see that the existing syncCircuitBreaker field is retained, ensuring that the circuit breaker functionality for sync requests remains intact.


68-76: LGTM!

The initialization and configuration of the mitigateCircuitBreaker field look good. The chosen parameters for the circuit breaker seem reasonable for handling mitigation requests:

  • The cyclic period of 10 seconds allows for periodic health checks and state adjustments.
  • The timeout of 1 minute provides a sufficient window for the service to respond to mitigation requests.
  • The maximum requests limit of 100 and the trip threshold of 50 help prevent overload and trigger the open state when necessary.

It's also good to see that the syncCircuitBreaker initialization remains unchanged, indicating that its configuration is still valid.

apps/agent/services/ratelimit/ratelimit_replication_test.go (1)

Line range hint 27-138: LGTM!

The changes to the test function look good:

  • The renaming of the function from TestReplication to TestSync improves clarity.
  • The removal of t.Skip() ensures that the test is executed as part of the test suite, helping catch any regressions in the rate limit synchronization functionality.

The test logic remains unchanged and comprehensive, testing the synchronization of rate limit data across multiple nodes in a cluster.

apps/agent/services/ratelimit/mitigate.go (1)

53-60: Good use of circuit breaker to enhance resilience

Wrapping the peer.client.Mitigate call with s.mitigateCircuitBreaker.Do introduces a circuit breaker pattern, which enhances the resilience of the system by preventing cascading failures when peers are unresponsive or experiencing errors.

apps/agent/services/ratelimit/ratelimit_test.go (1)

149-152: Verify the calculation of upper limit in rate limiting test

Between lines 149-152, the calculation of upper and its use might not align with the intended test logic. The comment mentions:

// At most 150% + 75% per additional ingress node should pass

However, the calculation is:

upper := 1.50 + 1.0*float64(len(ingressNodes)-1)

Verify whether this formula accurately represents the intended upper limit based on the comment. There may be a discrepancy that could affect the test's validity.

To ensure the calculation aligns with expectations, please double-check the formula and adjust it or the comment accordingly.

apps/api/src/pkg/ratelimit/client.ts (7)

6-6: Import statement is appropriate and necessary

The addition of the retry utility is correct and aligns with the implementation of retry logic in the code.


18-18: Cache structure updated appropriately

The cache property now holds entries with reset and current values, which simplifies the caching mechanism by removing the blocked state. This change enhances clarity and maintainability.


24-24: Constructor parameters updated accordingly

The constructor now accepts the updated cache structure, ensuring consistency throughout the class.


58-62: Verify cache update logic to prevent stale data

In the setCacheMax method, the cache is updated only when current > cached.current. If current is less than or equal to cached.current, the cache remains unchanged. This could potentially lead to stale cache data if current decreases over time.

Please confirm if this behavior is intentional. If the goal is to always have the most recent current value in the cache, consider updating the cache regardless of whether current is greater than cached.current:

- if (current > cached.current) {
+ if (current !== cached.current) {
    this.cache.set(id, { reset, current });
    return current;
- }
+ }

168-168: Cache updated after successful agent call

Updating the cache with the latest current and reset values from the agent ensures consistency in rate limiting decisions.


179-179: Cache updated in asynchronous operation

The cache is updated within the waitUntil asynchronous context. This ensures that even when operating asynchronously, the cache remains accurate.


202-202: Consistent cache update after local increment

After incrementing cached.current with cost, the cache is updated via setCacheMax. This maintains consistency in the cached values.

apps/agent/services/ratelimit/sliding_window.go (1)

283-283: Addition of 'Sequence' field to 'Window' struct looks good

Adding the Sequence field to the Window struct enhances sequence tracking and aligns with the changes made elsewhere in the code. This update appears appropriate.

bucket.windows[req.Window.GetSequence()] = req.Window
bucket.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using defer for unlocking to ensure lock is always released

Replacing defer bucket.Unlock() with an explicit bucket.Unlock() may lead to the lock not being released if a panic occurs between the lock and unlock calls. Using defer ensures that the lock is always released, even in the event of an error or panic.

Apply this diff to revert to using defer:

 func (s *service) Mitigate(ctx context.Context, req *ratelimitv1.MitigateRequest) (*ratelimitv1.MitigateResponse, error) {
 	ctx, span := tracing.Start(ctx, "ratelimit.Mitigate")
 	defer span.End()

 	s.logger.Info().Interface("req", req).Msg("mitigating")

 	duration := time.Duration(req.Duration) * time.Millisecond
 	bucket, _ := s.getBucket(bucketKey{req.Identifier, req.Limit, duration})
 	bucket.Lock()
+	defer bucket.Unlock()
 	bucket.windows[req.Window.GetSequence()] = req.Window
-	bucket.Unlock()

 	return &ratelimitv1.MitigateResponse{}, nil
 }
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
bucket.Unlock()
func (s *service) Mitigate(ctx context.Context, req *ratelimitv1.MitigateRequest) (*ratelimitv1.MitigateResponse, error) {
ctx, span := tracing.Start(ctx, "ratelimit.Mitigate")
defer span.End()
s.logger.Info().Interface("req", req).Msg("mitigating")
duration := time.Duration(req.Duration) * time.Millisecond
bucket, _ := s.getBucket(bucketKey{req.Identifier, req.Limit, duration})
bucket.Lock()
defer bucket.Unlock()
bucket.windows[req.Window.GetSequence()] = req.Window
return &ratelimitv1.MitigateResponse{}, nil
}

Comment on lines 53 to 60
_, err := s.mitigateCircuitBreaker.Do(ctx, func(innerCtx context.Context) (*connect.Response[ratelimitv1.MitigateResponse], error) {
return peer.client.Mitigate(innerCtx, connect.NewRequest(&ratelimitv1.MitigateRequest{
Identifier: req.identifier,
Limit: req.limit,
Duration: req.duration.Milliseconds(),
Window: req.window,
}))
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using a context with timeout to prevent hanging calls

Currently, context.Background() is used without any timeout or cancellation, which may lead to hanging calls if a peer does not respond. Consider using a context with a timeout to ensure that the Mitigate calls to peers do not block indefinitely.

Apply this diff to use a context with timeout:

 func (s *service) broadcastMitigation(req mitigateWindowRequest) {
-	ctx := context.Background()
+	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+	defer cancel()
 	node, err := s.cluster.FindNode(bucketKey{req.identifier, req.limit, req.duration}.toString())
 	if err != nil {
 		s.logger.Err(err).Msg("failed to find node")
 		return
 	}

Ensure that the context with timeout is used in the circuit breaker call:

 for _, peer := range peers {
 		_, err := s.mitigateCircuitBreaker.Do(ctx, func(innerCtx context.Context) (*connect.Response[ratelimitv1.MitigateResponse], error) {
-			return peer.client.Mitigate(innerCtx, connect.NewRequest(&ratelimitv1.MitigateRequest{
+			return peer.client.Mitigate(ctx, connect.NewRequest(&ratelimitv1.MitigateRequest{
 				Identifier: req.identifier,
 				Limit:      req.limit,
 				Duration:   req.duration.Milliseconds(),
 				Window:     req.window,
 			}))
 		})

Committable suggestion was skipped due to low confidence.

apps/agent/services/ratelimit/ratelimit_test.go Outdated Show resolved Hide resolved
apps/api/src/pkg/ratelimit/client.ts Show resolved Hide resolved
apps/api/src/pkg/ratelimit/client.ts Show resolved Hide resolved
Comment on lines +138 to +151
// FIXED-WINDOW
// uncomment
// currentWindowPercentage := float64(req.Time.UnixMilli()-currentWindow.Start) / float64(req.Duration.Milliseconds())
// previousWindowPercentage := 1.0 - currentWindowPercentage

// Calculate the current count including all leases
fromPreviousWindow := float64(previousWindow.Counter) * previousWindowPercentage
fromCurrentWindow := float64(currentWindow.Counter)
current := int64(math.Ceil(fromCurrentWindow + fromPreviousWindow))
// FIXED-WINDOW
// uncomment
// fromPreviousWindow := float64(previousWindow.Counter) * previousWindowPercentage
// fromCurrentWindow := float64(currentWindow.Counter)

// FIXED-WINDOW
// replace this with the following line
// current := int64(math.Ceil(fromCurrentWindow + fromPreviousWindow))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider removing commented-out code or using feature flags

The large block of commented-out code annotated with "FIXED-WINDOW" can clutter the codebase and reduce readability. Consider removing it or using a feature flag or configuration to toggle between sliding window and fixed-window implementations, rather than leaving code commented out.

…nd services

feat(workflows): add job_test_agent_local.yaml for testing agent locally
feat(workflows): add test_agent_local job to pr.yaml for local agent testing
…nnections

test(ratelimit): fix loop condition in TestAccuracy_fixed_time to iterate correctly
…ent Integration'

feat(workflows): update test job to run on integration tests directory
feat(workflows): add environment variables for cluster test and agent base URL
fix(ratelimit_test.go): fix calculation of upper limit in test
…tion

fix(ratelimit_replication_test): correct index to call Ratelimit on correct node
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants