Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add key-eviction-memory parameter for evicting key earlier to avoid OOM #831

Open
wants to merge 16 commits into
base: unstable
Choose a base branch
from

Conversation

hwware
Copy link
Member

@hwware hwware commented Jul 26, 2024

Reference: #742 and https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-best-practices-memory-management (Azure)

Generally, when clients set maxmemory-policy as allkeys-lru or other memory eviction policies, and maxmemory as well, If server runs as write-heavy workloads, the data stored in memory could reach the maxmemory limit very quickly, then OOM message will be reported.

If we have maxmemory-soft-scale parameter and client enable the lazyfree-lazy feature, for example, we can set maxmemory-soft-scale as 0.8, it means key eviction begin when used memory reaches 8GB if maxmemory is 10GB, thus, server could continue process clients data and OOM will not happen at this time.

Thus, we can see the benefit is we can delay OOM time.

One example for this paramter:
Assume

maxmemory 4GB
maxmemory-soft-scale 20

Then we could check the detail by info memory command:

maxmemory:4294967296
maxmemory_human:4.00G
maxmemory_policy:allkeys-lru
maxmemory_soft_scale:20
maxmemory_soft:3435973836
maxmemory_soft_human:3.20G

We could also update and get the maxmemory-soft-scale value during runtime as following:

config set maxmemory-soft-scale value
config get maxmemory-soft-scale

@hwware hwware force-pushed the maxmemory-reserved-parameter branch from e31dd8b to b8a98df Compare July 26, 2024 15:39
@hwware hwware requested review from PingXie, enjoy-binbin and madolson and removed request for PingXie and enjoy-binbin July 26, 2024 15:52
Copy link

codecov bot commented Jul 26, 2024

Codecov Report

Attention: Patch coverage is 50.00000% with 20 lines in your changes missing coverage. Please review.

Project coverage is 70.51%. Comparing base (4986310) to head (9c3068b).
Report is 31 commits behind head on unstable.

Files with missing lines Patch % Lines
src/config.c 10.00% 9 Missing ⚠️
src/evict.c 60.00% 8 Missing ⚠️
src/module.c 0.00% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #831      +/-   ##
============================================
- Coverage     70.68%   70.51%   -0.18%     
============================================
  Files           115      115              
  Lines         63177    63209      +32     
============================================
- Hits          44657    44569      -88     
- Misses        18520    18640     +120     
Files with missing lines Coverage Δ
src/server.c 87.65% <100.00%> (-0.02%) ⬇️
src/server.h 100.00% <100.00%> (ø)
src/module.c 9.64% <0.00%> (ø)
src/evict.c 94.56% <60.00%> (-3.19%) ⬇️
src/config.c 78.39% <10.00%> (-0.45%) ⬇️

... and 15 files with indirect coverage changes

@hwware hwware force-pushed the maxmemory-reserved-parameter branch from b8a98df to ba13a0c Compare August 29, 2024 14:50
@hwware hwware force-pushed the maxmemory-reserved-parameter branch from ba13a0c to e27e9de Compare September 9, 2024 16:16
@enjoy-binbin
Copy link
Member

this seem like a interesting feature, did not review, just drop a comment that approve the concept.

@madolson
Copy link
Member

I also like the idea. I just haven't really spent enough time thinking about it. Memory management is a big area in Valkey we need to improve.

@hwware hwware force-pushed the maxmemory-reserved-parameter branch from e27e9de to 0a68013 Compare September 16, 2024 17:10
@hwware hwware force-pushed the maxmemory-reserved-parameter branch from 0a68013 to 9596c78 Compare September 24, 2024 01:14
Copy link
Member

@PingXie PingXie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a quick look.

src/evict.c Outdated Show resolved Hide resolved
src/server.c Outdated Show resolved Hide resolved
src/evict.c Outdated Show resolved Hide resolved
@PingXie
Copy link
Member

PingXie commented Sep 25, 2024

+1 on introducing "back pressure" earlier. I feel that this could be used with `maxmemory_eviction_tenacity" to give an even smoother eviction experience.

@hwware hwware force-pushed the maxmemory-reserved-parameter branch 2 times, most recently from 35437eb to db966fe Compare September 26, 2024 02:46
Copy link
Member

@PingXie PingXie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hwware! LGTM overall. Can you add some tests too?

src/evict.c Outdated Show resolved Hide resolved
src/server.h Outdated Show resolved Hide resolved
@hwware
Copy link
Member Author

hwware commented Oct 2, 2024

Thanks @hwware! LGTM overall. Can you add some tests too?

Sure, ready to work, Thanks

@hwware hwware force-pushed the maxmemory-reserved-parameter branch 4 times, most recently from f7cc8c8 to 7a5584b Compare October 7, 2024 09:31
src/config.c Show resolved Hide resolved
valkey.conf Outdated Show resolved Hide resolved
src/evict.c Outdated
@@ -398,11 +398,12 @@ int getMaxmemoryState(size_t *total, size_t *logical, size_t *tofree, float *lev
if (total) *total = mem_reported;

/* We may return ASAP if there is no need to compute the level. */
if (!server.maxmemory) {
if (!server.maxmemory_available) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more, I feel that it is more preferable to model this new setting as a "soft" maxmemory, which can trigger key eviction earlier but won't cause OOM to be returned before hitting the actual maxmemory. Otherwise, we effectively create an alias of maxmemory. More specifically, I think performEviction should only return EVICT_FAIL when the memory usage goes beyond the real maxmemory.

Additionally, since getMaxmemoryState is also used outside of the OOM prevention path such as in VM_GetUsedMemoryRatio, we should consider parameterizing maxmemory and having it passed in by the caller instead, so that we can maintain the same semantics in these externally facing scenarios.

Thoughts?

Copy link
Member Author

@hwware hwware Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like what you suggested, if we can think when "maxmemory_available" is available, it is a soft maxmemory. Then you maybe think we should not return OOM, in the case, how we can return message to client, any idea?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right - how about just processing the command normally?

  1. if used_memory is below soft max, no change to the existing logic
  2. if used_memory is above soft max but below hard max, trigger key eviction and continue with the normal processing
  3. if used_memory is above hard max, no change to the existing logic, i.e., trigger key eviction and fail the command if the used_memory is still above hard max after the eviction)

@hwware hwware force-pushed the maxmemory-reserved-parameter branch 3 times, most recently from 200b203 to 4da32a2 Compare October 18, 2024 05:54
@hwware hwware force-pushed the maxmemory-reserved-parameter branch from 4da32a2 to 510265f Compare October 28, 2024 13:42
@PingXie
Copy link
Member

PingXie commented Nov 25, 2024

I suggest even more explicit config just in bytes

Yeah that makes sense.

the eviction can be done earlier by cron

earlier - yes;
by cron - not the case right now. It is still performed inline but I like the idea of async eviction for "soft" maxmemory too.

@hwware
Copy link
Member Author

hwware commented Nov 26, 2024

Interesting feature. I didn't look at it until now.

So the purpose is that instead of SET command performing eviction, which adds latency to SET, the eviction can be done earlier by cron and this makes the SET command faster?

As the top comment describe, the feature will enable the early key eviction process, it theory, it can not make SET command faster, even make it slower (due to the eviction while loop)

The config name and semantic is a little bit confusing to me. Why only 10-60% allowed? And does 10% actually mean that the soft limit is 90% of maxmemory? This is confusing to me. It's better that the config says maxmemory-soft-percent 90 or something like that.

Honestly said, I do not satisfy with this name either. The name origins from the https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-best-practices-memory-management. Your comment gives me a little bit hint, I would rather name it as key-eviction-memory, and the config as 'key-eviction-memory bytes'. Thus, the only relation between key-eviction-memory and maxmemory is key-eviction-memory should be less or equal to the maxmemory. It makes sense?

Another question: Why percent or maxmemory? If the server has 100MB write commands per minute, then we need to evict 100MB per minute, but it is very different in a server with maxmemory 1GB compared to a server with maxmemory 100GB.

I suggest even more explicit config just in bytes, so you can configure simply like this:

maxmemory 4GB
maxmemory-soft 3.5GB

@zuiderkwast
Copy link
Contributor

@hwware

maxmemory 4GB
key-eviction-memory 3.5MB

Yes, it is nice. Maybe missing a word like max, limit or threshold? key-eviction-threshold 3.5GB?

@hwware
Copy link
Member Author

hwware commented Nov 27, 2024

@hwware

maxmemory 4GB
key-eviction-memory 3.5MB

Yes, it is nice. Maybe missing a word like max, limit or threshold? key-eviction-threshold 3.5GB?

threshold is acceptable. key-eviction-maxmemory is not good, because I thought maxmemory is pure max limit memory, and limit a little bit weird.

@soloestoy
Copy link
Member

key-eviction-memory sounds more reasonable.

I have always hoped that we could accurately measure the memory used by various different modules and then configure different thresholds and strategies, such as maxmemory-data, so that eviction would only occur when the data exceeds this threshold. Here, data specifically refers to all key-values, excluding other system memory usage such as client buffer or slowlog, etc. Although this PR doesn't completely meet my needs, I feel like the general direction is aligned.

@hwware hwware changed the title Add maxmemory-reserved-scale parameter for evicting key earlier to avoid OOM Add key-eviction-memory parameter for evicting key earlier to avoid OOM Dec 4, 2024
@hwware hwware force-pushed the maxmemory-reserved-parameter branch 3 times, most recently from be2efd1 to aa5d67d Compare December 11, 2024 16:46
@zuiderkwast
Copy link
Contributor

I have always hoped that we could accurately measure the memory used by various different modules and then configure different thresholds and strategies, such as maxmemory-data, so that eviction would only occur when the data exceeds this threshold. Here, data specifically refers to all key-values, excluding other system memory usage such as client buffer or slowlog, etc. Although this PR doesn't completely meet my needs, I feel like the general direction is aligned.

@soloestoy There is already an INFO field used_memory_dataset. Do you think we should use this number to do eviction with new config like maxmemory-data? Documentation of the INFO field:

used_memory_dataset: The size in bytes of the dataset (used_memory_overhead subtracted from used_memory)

@PingXie
Copy link
Member

PingXie commented Jan 8, 2025

@soloestoy There is already an INFO field used_memory_dataset. Do you think we should use this number to do eviction with new config like maxmemory-data? Documentation of the INFO field:

This seems to be going in a slightly different direction. My mental model so far has been a "tighter" maxmemory, which is still a super set of used_memory_dataset. With @soloestoy's proposal of per "component" thresholds, more specifically "data only" threshold in this case, I think the implementation would be quite different. I can see a "tigher" maxmemory being a natural extension of maxmemory while used_memory_dataset providing a new capability. I don't have a strong opinion between the two but I feel that we only need one. We should discuss some more and settle down on one.

@soloestoy
Copy link
Member

@soloestoy There is already an INFO field used_memory_dataset. Do you think we should use this number to do eviction with new config like maxmemory-data?

Yes, I tend to use used_memory_dataset for eviction, as it better aligns with the semantics of maxmemory-data or key-eviction-memory and the actual action: delete key.

However, the current used_memory_dataset is not accurate. It is used_memory - overhead_memory, a reverse calculation rather than a direct statistical analysis of the data, we can further optimize this issue in other PRs.

hwware and others added 16 commits January 22, 2025 20:13
Signed-off-by: hwware <[email protected]>
Signed-off-by: hwware <[email protected]>
Signed-off-by: hwware <[email protected]>
Signed-off-by: Shivshankar-Reddy <[email protected]>
Signed-off-by: hwware <[email protected]>
Signed-off-by: hwware <[email protected]>
@hwware hwware force-pushed the maxmemory-reserved-parameter branch 2 times, most recently from a3b5eb4 to 1e06571 Compare January 22, 2025 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Idea
Development

Successfully merging this pull request may close these issues.

7 participants