-
Notifications
You must be signed in to change notification settings - Fork 532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
revert: fixed window ratelimiting #2116
Conversation
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
WalkthroughWalkthroughThe changes involve a significant restructuring of the clock and rate limiting functionalities within the application. The Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Agent
participant RateLimiter
participant Clock
User->>Agent: Request
Agent->>RateLimiter: Check Rate Limit
RateLimiter->>Clock: Get Current Time
Clock-->>RateLimiter: Current Time
RateLimiter-->>Agent: Rate Limit Status
Agent-->>User: Response
Possibly related PRs
Suggested reviewers
Recent review detailsConfiguration used: CodeRabbit UI Files selected for processing (2)
Files not reviewed due to no reviewable changes (1)
Files skipped from review as they are similar to previous changes (1)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Thank you for following the naming conventions for pull request titles! 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
Outside diff range and nitpick comments (3)
apps/agent/services/ratelimit/ratelimit_test.go (1)
70-70
: Remove redundant error check forerr
At line 70, there is a redundant
require.NoError(t, err)
statement. The errorerr
has not been modified since the previous check at line 68. This extra check is unnecessary.Apply this diff to remove the redundant line:
- require.NoError(t, err)
apps/api/src/pkg/ratelimit/client.ts (1)
Line range hint
130-202
: Consider using a more efficient cache eviction strategyIn the
setCacheMax
method, cache eviction is performed by iterating over entries and deleting those that have expired when the cache size exceedsmaxEntries
. This could become inefficient as the cache grows.Consider using a Least Recently Used (LRU) cache or a similar data structure that handles eviction more efficiently. This can improve performance and reduce latency caused by cache maintenance.
Example using an LRU cache:
Implement an LRU cache mechanism or utilize an existing library to manage cache entries based on their usage and expiry.
apps/agent/services/ratelimit/sliding_window.go (1)
113-114
: Typo in comment: 'cachelayer' should be 'cache layer'In the comment, "we are reverting this to fixed-window until we can get rid of the cloudflare cachelayer", "cachelayer" should be "cache layer" for clarity.
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (9)
- apps/agent/pkg/clock/real_clock.go (1 hunks)
- apps/agent/pkg/clock/test_clock.go (1 hunks)
- apps/agent/services/ratelimit/mitigate.go (2 hunks)
- apps/agent/services/ratelimit/ratelimit_mitigation_test.go (2 hunks)
- apps/agent/services/ratelimit/ratelimit_replication_test.go (1 hunks)
- apps/agent/services/ratelimit/ratelimit_test.go (1 hunks)
- apps/agent/services/ratelimit/service.go (2 hunks)
- apps/agent/services/ratelimit/sliding_window.go (4 hunks)
- apps/api/src/pkg/ratelimit/client.ts (9 hunks)
Additional comments not posted (27)
apps/agent/pkg/clock/real_clock.go (4)
5-6
: LGTM!The
RealClock
struct is defined correctly as an empty struct, indicating that it does not maintain any internal state. This aligns with the transition from a mockable clock to a real-time clock implementation.
8-10
: LGTM!The
New()
function is implemented correctly as a constructor for creating instances ofRealClock
. It returns a pointer to a newRealClock
instance without any parameters, which aligns with the AI-generated summary.
12-12
: LGTM!The variable declaration
var _ Clock = &RealClock{}
is used correctly to ensure thatRealClock
implements theClock
interface at compile-time. This aligns with the AI-generated summary and is a common pattern in Go for interface checks.
14-16
: LGTM!The
Now()
method on theRealClock
struct is implemented correctly to return the current time usingtime.Now()
. This aligns with the AI-generated summary and provides the expected functionality for a real-time clock.apps/agent/pkg/clock/test_clock.go (6)
5-6
: LGTM!The
TestClock
struct is well-defined and serves the purpose of mocking time in tests. Thenow
field accurately represents the current time of the test clock.
9-13
: LGTM!The
NewTestClock
constructor function is implemented correctly. It properly handles the optionalnow
parameter and defaults to the current time when no initial time is provided. The function returns a pointer to the newly createdTestClock
instance, which is the expected behavior.
16-16
: LGTM!The interface implementation at line 16 correctly verifies that
TestClock
satisfies theClock
interface. This ensures thatTestClock
can be used wherever aClock
is expected.
18-20
: LGTM!The
Now
method is implemented correctly. It returns the current time of the test clock by returning the value of thenow
field. The method logic is straightforward and has no issues.
23-26
: LGTM!The
Tick
method is implemented correctly. It advances the clock by the given duration, updates thenow
field to reflect the new time, and returns the updated time. The method logic is sound and serves the purpose of simulating the passage of time in tests.
29-31
: LGTM!The
Set
method is implemented correctly. It sets the clock to the given time, updates thenow
field to reflect the new time, and returns the updated time. The method logic is straightforward and serves the purpose of setting the test clock to a specific time for testing scenarios.apps/agent/services/ratelimit/ratelimit_mitigation_test.go (4)
27-27
: LGTM!The expanded range of cluster sizes improves the test coverage by including both small and large clusters. This change enhances the robustness of the rate limiting tests.
97-97
: Good catch!The modified loop condition fixes an off-by-one error and ensures that the rate limit is saturated with exactly
limit
requests. This change improves the accuracy of the test.
103-103
: Excellent fix!The modified assertion correctly checks that the rate limit response is unsuccessful after saturation. This change improves the correctness and reliability of the test by validating the expected rate limiting behavior.
111-115
: Nice touch!Correcting the typo in the comment improves the clarity and readability of the code. While it doesn't affect the functionality, it enhances the overall code quality and maintainability.
apps/agent/services/ratelimit/service.go (2)
41-42
: LGTM!The addition of the
mitigateCircuitBreaker
field is a good enhancement to handle mitigation requests using a dedicated circuit breaker. This can improve the resilience and fault tolerance of the service when dealing with mitigation requests.It's also good to see that the existing
syncCircuitBreaker
field is retained, ensuring that the circuit breaker functionality for sync requests remains intact.
68-76
: LGTM!The initialization and configuration of the
mitigateCircuitBreaker
field look good. The chosen parameters for the circuit breaker seem reasonable for handling mitigation requests:
- The cyclic period of 10 seconds allows for periodic health checks and state adjustments.
- The timeout of 1 minute provides a sufficient window for the service to respond to mitigation requests.
- The maximum requests limit of 100 and the trip threshold of 50 help prevent overload and trigger the open state when necessary.
It's also good to see that the
syncCircuitBreaker
initialization remains unchanged, indicating that its configuration is still valid.apps/agent/services/ratelimit/ratelimit_replication_test.go (1)
Line range hint
27-138
: LGTM!The changes to the test function look good:
- The renaming of the function from
TestReplication
toTestSync
improves clarity.- The removal of
t.Skip()
ensures that the test is executed as part of the test suite, helping catch any regressions in the rate limit synchronization functionality.The test logic remains unchanged and comprehensive, testing the synchronization of rate limit data across multiple nodes in a cluster.
apps/agent/services/ratelimit/mitigate.go (1)
53-60
: Good use of circuit breaker to enhance resilienceWrapping the
peer.client.Mitigate
call withs.mitigateCircuitBreaker.Do
introduces a circuit breaker pattern, which enhances the resilience of the system by preventing cascading failures when peers are unresponsive or experiencing errors.apps/agent/services/ratelimit/ratelimit_test.go (1)
149-152
: Verify the calculation ofupper
limit in rate limiting testBetween lines 149-152, the calculation of
upper
and its use might not align with the intended test logic. The comment mentions:// At most 150% + 75% per additional ingress node should pass
However, the calculation is:
upper := 1.50 + 1.0*float64(len(ingressNodes)-1)Verify whether this formula accurately represents the intended upper limit based on the comment. There may be a discrepancy that could affect the test's validity.
To ensure the calculation aligns with expectations, please double-check the formula and adjust it or the comment accordingly.
apps/api/src/pkg/ratelimit/client.ts (7)
6-6
: Import statement is appropriate and necessaryThe addition of the
retry
utility is correct and aligns with the implementation of retry logic in the code.
18-18
: Cache structure updated appropriatelyThe
cache
property now holds entries withreset
andcurrent
values, which simplifies the caching mechanism by removing theblocked
state. This change enhances clarity and maintainability.
24-24
: Constructor parameters updated accordinglyThe constructor now accepts the updated
cache
structure, ensuring consistency throughout the class.
58-62
: Verify cache update logic to prevent stale dataIn the
setCacheMax
method, the cache is updated only whencurrent > cached.current
. Ifcurrent
is less than or equal tocached.current
, the cache remains unchanged. This could potentially lead to stale cache data ifcurrent
decreases over time.Please confirm if this behavior is intentional. If the goal is to always have the most recent
current
value in the cache, consider updating the cache regardless of whethercurrent
is greater thancached.current
:- if (current > cached.current) { + if (current !== cached.current) { this.cache.set(id, { reset, current }); return current; - } + }
168-168
: Cache updated after successful agent callUpdating the cache with the latest
current
andreset
values from the agent ensures consistency in rate limiting decisions.
179-179
: Cache updated in asynchronous operationThe cache is updated within the
waitUntil
asynchronous context. This ensures that even when operating asynchronously, the cache remains accurate.
202-202
: Consistent cache update after local incrementAfter incrementing
cached.current
withcost
, the cache is updated viasetCacheMax
. This maintains consistency in the cached values.apps/agent/services/ratelimit/sliding_window.go (1)
283-283
: Addition of 'Sequence' field to 'Window' struct looks goodAdding the
Sequence
field to theWindow
struct enhances sequence tracking and aligns with the changes made elsewhere in the code. This update appears appropriate.
bucket.windows[req.Window.GetSequence()] = req.Window | ||
bucket.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using defer
for unlocking to ensure lock is always released
Replacing defer bucket.Unlock()
with an explicit bucket.Unlock()
may lead to the lock not being released if a panic occurs between the lock and unlock calls. Using defer
ensures that the lock is always released, even in the event of an error or panic.
Apply this diff to revert to using defer
:
func (s *service) Mitigate(ctx context.Context, req *ratelimitv1.MitigateRequest) (*ratelimitv1.MitigateResponse, error) {
ctx, span := tracing.Start(ctx, "ratelimit.Mitigate")
defer span.End()
s.logger.Info().Interface("req", req).Msg("mitigating")
duration := time.Duration(req.Duration) * time.Millisecond
bucket, _ := s.getBucket(bucketKey{req.Identifier, req.Limit, duration})
bucket.Lock()
+ defer bucket.Unlock()
bucket.windows[req.Window.GetSequence()] = req.Window
- bucket.Unlock()
return &ratelimitv1.MitigateResponse{}, nil
}
Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
bucket.Unlock() | |
func (s *service) Mitigate(ctx context.Context, req *ratelimitv1.MitigateRequest) (*ratelimitv1.MitigateResponse, error) { | |
ctx, span := tracing.Start(ctx, "ratelimit.Mitigate") | |
defer span.End() | |
s.logger.Info().Interface("req", req).Msg("mitigating") | |
duration := time.Duration(req.Duration) * time.Millisecond | |
bucket, _ := s.getBucket(bucketKey{req.Identifier, req.Limit, duration}) | |
bucket.Lock() | |
defer bucket.Unlock() | |
bucket.windows[req.Window.GetSequence()] = req.Window | |
return &ratelimitv1.MitigateResponse{}, nil | |
} |
_, err := s.mitigateCircuitBreaker.Do(ctx, func(innerCtx context.Context) (*connect.Response[ratelimitv1.MitigateResponse], error) { | ||
return peer.client.Mitigate(innerCtx, connect.NewRequest(&ratelimitv1.MitigateRequest{ | ||
Identifier: req.identifier, | ||
Limit: req.limit, | ||
Duration: req.duration.Milliseconds(), | ||
Window: req.window, | ||
})) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using a context with timeout to prevent hanging calls
Currently, context.Background()
is used without any timeout or cancellation, which may lead to hanging calls if a peer does not respond. Consider using a context with a timeout to ensure that the Mitigate
calls to peers do not block indefinitely.
Apply this diff to use a context with timeout:
func (s *service) broadcastMitigation(req mitigateWindowRequest) {
- ctx := context.Background()
+ ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+ defer cancel()
node, err := s.cluster.FindNode(bucketKey{req.identifier, req.limit, req.duration}.toString())
if err != nil {
s.logger.Err(err).Msg("failed to find node")
return
}
Ensure that the context with timeout is used in the circuit breaker call:
for _, peer := range peers {
_, err := s.mitigateCircuitBreaker.Do(ctx, func(innerCtx context.Context) (*connect.Response[ratelimitv1.MitigateResponse], error) {
- return peer.client.Mitigate(innerCtx, connect.NewRequest(&ratelimitv1.MitigateRequest{
+ return peer.client.Mitigate(ctx, connect.NewRequest(&ratelimitv1.MitigateRequest{
Identifier: req.identifier,
Limit: req.limit,
Duration: req.duration.Milliseconds(),
Window: req.window,
}))
})
Committable suggestion was skipped due to low confidence.
// FIXED-WINDOW | ||
// uncomment | ||
// currentWindowPercentage := float64(req.Time.UnixMilli()-currentWindow.Start) / float64(req.Duration.Milliseconds()) | ||
// previousWindowPercentage := 1.0 - currentWindowPercentage | ||
|
||
// Calculate the current count including all leases | ||
fromPreviousWindow := float64(previousWindow.Counter) * previousWindowPercentage | ||
fromCurrentWindow := float64(currentWindow.Counter) | ||
current := int64(math.Ceil(fromCurrentWindow + fromPreviousWindow)) | ||
// FIXED-WINDOW | ||
// uncomment | ||
// fromPreviousWindow := float64(previousWindow.Counter) * previousWindowPercentage | ||
// fromCurrentWindow := float64(currentWindow.Counter) | ||
|
||
// FIXED-WINDOW | ||
// replace this with the following line | ||
// current := int64(math.Ceil(fromCurrentWindow + fromPreviousWindow)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider removing commented-out code or using feature flags
The large block of commented-out code annotated with "FIXED-WINDOW" can clutter the codebase and reduce readability. Consider removing it or using a feature flag or configuration to toggle between sliding window and fixed-window implementations, rather than leaving code commented out.
…nd services feat(workflows): add job_test_agent_local.yaml for testing agent locally feat(workflows): add test_agent_local job to pr.yaml for local agent testing
…nnections test(ratelimit): fix loop condition in TestAccuracy_fixed_time to iterate correctly
…ent Integration' feat(workflows): update test job to run on integration tests directory feat(workflows): add environment variables for cluster test and agent base URL
…in job_test_agent_local.yaml
fix(ratelimit_test.go): fix calculation of upper limit in test
…s to include only 1, 3, and 5 nodes
…tion fix(ratelimit_replication_test): correct index to call Ratelimit on correct node
Summary by CodeRabbit
Release Notes
New Features
RealClock
implementation for accurate timekeeping.Improvements
blocked
state.Bug Fixes
Tests