Excessive memory usage by k8s-dqlite #196

sbidoul · 2024-10-30T11:58:00Z

Hello,

Since a few weeks (sorry I cannot be more precise), I notice an increased memory utilization over time on dqlite nodes.

Memory usage on the node over a week looks like this, and I can attribute the increase in RAM use to the k8s-dqlite process.

The cluster has been on 1.29.x since early 2024, but the leak only started to manifest itself recently, presumably following an automatic minor snap update.

Is there anything I can do to help diagnosing this?

jcjveraa · 2024-11-16T13:50:55Z

Hi there,

Same here, on my 3 node experimental cluster with 16GB ram per node, while running barely any workloads the memory usage for dqlite is 6.9% of total memory - 1.1 GB for a 12MB database?

Commands used (note: by the time I ran this it was 7.0% - 4th column of the output of ps aux ).

$ ps aux | grep dqlite
root         818  5.5  7.0 2406272 1147972 ?     Ssl  Nov09 538:51 /snap/microk8s/7394/bin/k8s-dqlite --storage-dir=/var/snap/microk8s/7394/var/kubernetes/backend/ --listen=unix:///var/snap/microk8s/7394/var/kubernetes/backend/kine.sock:12379

$ microk8s dbctl backup 
$ tar -xvzf backup-2024-11-16-13-44-42.tar.gz
$ cd backup-2024-11-16-13-44-42
$ du -h
12M	.

sbidoul · 2024-11-30T09:33:20Z

@jcjveraa to be clear, my issue is not about high memory usage, but about a continuous memory usage increase, suggesting a memory leak.

jcjveraa · 2024-11-30T17:12:27Z

@jcjveraa to be clear, my issue is not about high memory usage, but about an continuous memory usage increase, suggesting a memory leak.

Yes same here, in the meantime I’ve switched to k3s in high availability (etdc) mode, with the same workload, and memory consumption is completely flat. For me “my workloads + microk8s” was certainly the cause for the memory leak, and I suspect it to be in dqlite.

svetlak0f · 2024-12-06T08:20:17Z

Yes, I noticed the same behavior for both of my microk8s clusters

When inspecting with htop, there was a huge RAM consumption by k8s-dqlite, which disappeared after restarting the DB daemon, but returned after some time:

snap restart microk8s.daemon-k8s-dqlite

It definitely looks like a memory leak. I think it makes sense to put this command in cron like a temporary fix

louiseschmidtgen · 2025-01-13T12:56:13Z

Hi @sbidoul, @jcjveraa and @svetlak0f,

Thank you for reporting your issue with us.

While the behavior you are observing looks like a memory leak it is actually a consequence of a configuration in Dqlite. In a nutshell, this has to do with the amount of transactions that Dqlite caches in memory.

Would you be able to try a workaround for this issue? There exists a tuning.yaml file which can be placed in the dqlite directory /var/snap/microk8s/current/var/kubernetes/backend/. Adjusting the value of trailing to 1024 or 512 should mitigate this behavior:

# /var/snap/microk8s/current/var/kubernetes/backend/tuning.yaml
snapshot:
  trailing: 512

Please restart the k8s-dqlite service with sudo snap restart microk8s.daemon-k8s-dqlite for the change to take effect.

Please let us know if this workaround helps you!

sbidoul · 2025-01-13T14:58:03Z

Hello @louiseschmidtgen,

Thanks a lot for looking into this!

When I try that procedure, I observe a important surge in CPU usage: (1,2,3) is the addition of tuning.yaml with trailing: 512. Emptying tuning.yaml (4) and restarting microk8s.daemon-k8s-dqlite gets it back to normal.

jnugh · 2025-01-13T16:55:21Z

We are also seeing high memory usage, especially noticable on very small clusters. The memory consumed by dqlite is a lot larger than the disk size of /var/snap/microk8s/current/var/kubernetes/backend. ~6.3GB of memory usage and a lot less on disk.

du -sh  /var/snap/microk8s/current/var/kubernetes/backend
276M	/var/snap/microk8s/current/var/kubernetes/backend

We are also seeing high CPU usage after setting trailing to a lower value.

louiseschmidtgen · 2025-01-16T10:54:51Z

Hi @sbidoul,

Thanks for testing the workaround.

Generally, the snapshot.threshold parameter needs to be adjusted to twice the trailing (or 4 times).
We will look into the CPU spike on our end, that should not occur.

Best regards,
Louise

zhhuabj · 2025-01-20T05:39:47Z

Hi @louiseschmidtgen, Thank you very much for your help with this issue, I also tested this workaround, but unfortunately, it doesnot seem to have resolved the problem.

ubuntu@juju-40a105-microk8s-0:~$ cat /var/snap/microk8s/current/var/kubernetes/backend/tuning.yaml
snapshot:
  trailing: 512
ubuntu@juju-40a105-microk8s-0:~$ sudo systemctl status snap.microk8s.daemon-k8s-dqlite.service |grep running
     Active: active (running) since Thu 2025-01-16 10:14:46 UTC; 3 days ago
     
ubuntu@juju-40a105-microk8s-0:~$ cat memory_usage.log |head -n1
Thu Jan 16 10:15:21 UTC 2025: VmRSS:      206120 kB
ubuntu@juju-40a105-microk8s-0:~$ cat memory_usage.log |tail -n1
Mon Jan 20 05:15:37 UTC 2025: VmRSS:     1562188 kB

ubuntu@juju-40a105-microk8s-0:~$ cat memory_usage.sh
#!/bin/bash
output_file="memory_usage.log"
while true; do
    echo -n "$(date): " >> "$output_file"
    cat /proc/$(pgrep k8s-dqlite)/status | grep VmRSS >> "$output_file"
    sleep 600
done

ubuntu@juju-40a105-microk8s-0:~$ date; du -sh /var/snap/microk8s/current/var/kubernetes/backend
Thu Jan 16 10:18:31 UTC 2025
119M  /var/snap/microk8s/current/var/kubernetes/backend
ubuntu@juju-40a105-microk8s-0:~$ date; du -sh /var/snap/microk8s/current/var/kubernetes/backend
Mon Jan 20 05:38:16 UTC 2025
93M     /var/snap/microk8s/current/var/kubernetes/backend

hartyporpoise · 2025-01-24T03:50:48Z

After observing watch ps aux --sort=-%mem for a while I noticed that k8s-dqlite only increases as well. Below are some metrics from Prometheus before and after running snap restart microk8s.daemon-k8s-dqlite:

before

after

zhhuabj · 2025-01-24T07:03:02Z

Although I can see the continuous growth of VmRSS, before I complied the k8s-dqlite part of hack/dynamic-dqlite.sh with asan support and did not see any memory leaks during runtime (pls refer to asan result for more details - https://paste.ubuntu.com/p/bVmdHq9tVR/),

but today when I tried to enable asan support for sqlite3 part and then compiled it with the command 'make dynamic',

root@focalsru:~/k8s-dqlite# grep -r 'build sqlite3' hack/dynamic-dqlite.sh -A18
# build sqlite3
ASAN_OPTION="-fsanitize=address -fno-omit-frame-pointer -fsanitize-recover=all -g -O1"
if [ ! -f "${BUILD_DIR}/sqlite/libsqlite3.la" ]; then
  (
    cd "${BUILD_DIR}"
    rm -rf sqlite
    git clone "${REPO_SQLITE}" --depth 1 --branch "${TAG_SQLITE}" > /dev/null
    cd sqlite
    export CFLAGS="${ASAN_OPTION}"
    export LDFLAGS="-fsanitize=address"
    ./configure --disable-readline \
      > /dev/null
    make libsqlite3.la -j > /dev/null
    unset CFLAGS
    unset LDFLAGS
  )
fi

# build dqlite

then I saw these memory leaks WARNING - https://paste.ubuntu.com/p/R55HsqhYTB/

+ cd sqlite                                                                                                                                                                            
+ export 'CFLAGS=-fsanitize=address -fno-omit-frame-pointer -fsanitize-recover=all -g -O1'                                                                                             
+ CFLAGS='-fsanitize=address -fno-omit-frame-pointer -fsanitize-recover=all -g -O1'                                                                                                    
+ export LDFLAGS=-fsanitize=address                                                                                                                                                    
+ LDFLAGS=-fsanitize=address                                                                                                                                                           
+ ./configure --disable-readline                                                                                                                                                       
configure: WARNING: Can't find Tcl configuration definitions                                                                                                                           
configure: WARNING: *** Without Tcl the regression tests cannot be executed ***                                                                                                        
configure: WARNING: *** Consider using --with-tcl=... to define location of Tcl ***                                                                                                    
+ make libsqlite3.la -j                                                                                                                                                                
AddressSanitizer:DEADLYSIGNAL                                                                                                                                                          
AddressSanitizer:DEADLYSIGNAL                                                                                                                                                          
AddressSanitizer:DEADLYSIGNAL                                                                                                                                                          
AddressSanitizer:DEADLYSIGNAL                                                                                                                                                          
                                                                                                                                                                                       
=================================================================                                                                                                                      
==387496==ERROR: LeakSanitizer: detected memory leaks                                                                                                                                  
                                                                                                                                                                                       
Direct leak of 10240 byte(s) in 1 object(s) allocated from:                                                                                                                            
    #0 0x7342ea90da06 in __interceptor_calloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cc:153                                                                               
    #1 0x64f38cc68bbd in Symbol_insert /root/k8s-dqlite/hack/.build/dynamic/sqlite/tool/lemon.c:5519                                                                                   
    #2 0x64f38cc6a0c9 in Symbol_new /root/k8s-dqlite/hack/.build/dynamic/sqlite/tool/lemon.c:5421                                                                                      
    #3 0x64f38cc700b1 in parseonetoken /root/k8s-dqlite/hack/.build/dynamic/sqlite/tool/lemon.c:2620                                                                                   
    #4 0x64f38cc700b1 in Parse /root/k8s-dqlite/hack/.build/dynamic/sqlite/tool/lemon.c:3112                                                                                           
    #5 0x64f38cc75868 in main /root/k8s-dqlite/hack/.build/dynamic/sqlite/tool/lemon.c:1703                                                                                            
    #6 0x7342ea632082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082)                                                                                                   
                                                                                                                                                                                       
Direct leak of 5120 byte(s) in 1 object(s) allocated from:                                                                                                                             
    #0 0x7342ea90da06 in __interceptor_calloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cc:153                                                                               
    #1 0x64f38cc68825 in Symbol_init /root/k8s-dqlite/hack/.build/dynamic/sqlite/tool/lemon.c:5482                                                                                     
    #2 0x64f38cc75729 in main /root/k8s-dqlite/hack/.build/dynamic/sqlite/tool/lemon.c:1692                                                                                            
    #3 0x7342ea632082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082)                                                                                                   
                                                                                                                                                                                       
Direct leak of 2048 byte(s) in 1 object(s) allocated from:                                                                                                                             
    #0 0x7342ea90da06 in __interceptor_calloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cc:153                                                                               
    #1 0x64f38cc730b4 in Configtable_init /root/k8s-dqlite/hack/.build/dynamic/sqlite/tool/lemon.c:5821                                                                                
    #2 0x64f38cc731b5 in Configlist_init /root/k8s-dqlite/hack/.build/dynamic/sqlite/tool/lemon.c:1331                                                                                 
    #3 0x64f38cc7508d in FindStates /root/k8s-dqlite/hack/.build/dynamic/sqlite/tool/lemon.c:908                                                                                       
    #4 0x64f38cc762bc in main /root/k8s-dqlite/hack/.build/dynamic/sqlite/tool/lemon.c:1755                                                                                            
    #5 0x7342ea632082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082)                                                                                                   
                                                                                                                                                                                       
SUMMARY: AddressSanitizer: 17408 byte(s) leaked in 3 allocation(s).                                                                                                                    
make[1]: *** [Makefile:1140: parse.c] Error 1                                                                                                                                          
make[1]: *** Waiting for unfinished jobs....
make[1]: *** [Makefile:1267: fts5parse.c] Segmentation fault (core dumped)
make: *** [Makefile:38: bin/dynamic/lib/libdqlite.so] Error 2

but I checked the code /root/k8s-dqlite/hack/.build/dynamic/sqlite/tool/lemon.c, it has the following code comment 'Just leak it', so it feels like this leak was intentional

    /* free(x2a->tbl); // This program was originally written for 16-bit
    ** machines.  Don't worry about freeing this trivial amount of memory
    ** on modern platforms.  Just leak it. */

louiseschmidtgen · 2025-01-28T15:52:09Z

Hi @zhhuabj,

Thank you for helping us debug the issue.

Lemon is the parsing tool for SQLite, the 17408 byte(s) are leaked during the build process. The leak does not occur while your cluster is running. You're welcome to raise an issue in the Sqlite Forum to address this particular concern.

We're continuing to look into a mitigation on our end.

Thank you for your patience and contribution!
Louise

louiseschmidtgen · 2025-01-29T12:37:28Z

Hello @sbidoul, @jcjveraa, @jnugh, @zhhuabj and @hartyporpoise,

Thank you all for your contributions to the issue.

We will update the snapshot default configurations for the next release.

Currently, the default snapshot configuration is 1024 for the threshold and 8192 for trailing which are values that are quite large for small clusters.

In the mean time, I would recommend setting smaller values in the tuning configuration, such as:

# /var/snap/microk8s/current/var/kubernetes/backend/tuning.yaml
snapshot:
  trailing: 512
  threshold: 384

... Or trailing 1024, threshold 512, or your custom parameter combination.

Setting only the trailing parameter sets the threshold to 0 leading to the CPU issue mentioned in this issue.

Ensure that any combination used for the tuning should have trailing > threshold.

See attached a sample mem/cpu usage for different configurations on an idle microk8s cluster recorded over 20 minutes.

I would appreciate your feedback on the tuning configuration options as a mitigation for your issues.

All the best,
Louise

louiseschmidtgen mentioned this issue Jan 29, 2025

Default and minimum snapshot parameters #230

Open

louiseschmidtgen changed the title ~~k8s-dqlite memory leak ?~~ Excessive memory usage by k8s-dqlite Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive memory usage by k8s-dqlite #196

Excessive memory usage by k8s-dqlite #196

sbidoul commented Oct 30, 2024 •

edited

Loading

jcjveraa commented Nov 16, 2024

sbidoul commented Nov 30, 2024 •

edited

Loading

jcjveraa commented Nov 30, 2024

svetlak0f commented Dec 6, 2024

louiseschmidtgen commented Jan 13, 2025

sbidoul commented Jan 13, 2025

jnugh commented Jan 13, 2025

louiseschmidtgen commented Jan 16, 2025

zhhuabj commented Jan 20, 2025 •

edited

Loading

hartyporpoise commented Jan 24, 2025 •

edited

Loading

zhhuabj commented Jan 24, 2025

louiseschmidtgen commented Jan 28, 2025

louiseschmidtgen commented Jan 29, 2025

Excessive memory usage by k8s-dqlite #196

Excessive memory usage by k8s-dqlite #196

Comments

sbidoul commented Oct 30, 2024 • edited Loading

jcjveraa commented Nov 16, 2024

sbidoul commented Nov 30, 2024 • edited Loading

jcjveraa commented Nov 30, 2024

svetlak0f commented Dec 6, 2024

louiseschmidtgen commented Jan 13, 2025

sbidoul commented Jan 13, 2025

jnugh commented Jan 13, 2025

louiseschmidtgen commented Jan 16, 2025

zhhuabj commented Jan 20, 2025 • edited Loading

hartyporpoise commented Jan 24, 2025 • edited Loading

zhhuabj commented Jan 24, 2025

louiseschmidtgen commented Jan 28, 2025

louiseschmidtgen commented Jan 29, 2025

sbidoul commented Oct 30, 2024 •

edited

Loading

sbidoul commented Nov 30, 2024 •

edited

Loading

zhhuabj commented Jan 20, 2025 •

edited

Loading

hartyporpoise commented Jan 24, 2025 •

edited

Loading