Raise aio-max-nr on GHA #596

cole-miller · 2024-02-20T21:39:34Z

No description provided.

codecov · 2024-02-20T21:44:28Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.23%. Comparing base (61fc84c) to head (e814a1c).
Report is 22 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #596      +/-   ##
==========================================
+ Coverage   81.21%   81.23%   +0.01%     
==========================================
  Files         191      192       +1     
  Lines       27033    27061      +28     
  Branches     4944     4980      +36     
==========================================
+ Hits        21956    21982      +26     
+ Misses       3483     3436      -47     
- Partials     1594     1643      +49

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

cole-miller · 2024-02-20T21:53:31Z

Ah, it's io_setup in probeAsyncIO that's failing with EAGAIN due to aio-nr reaching aio-max-nr. I'm not sure what probeAsyncIO should actually do in this case -- it probably shouldn't just report success, but on the other hand it doesn't seem good to fail dqlite_node_create because of a temporary condition. Maybe probeAsyncIO should have a retry loop with backoff internally?

cole-miller · 2024-02-22T21:15:52Z

Summary: the issue with the flaky test cluster/restart is that it somewhat consistently hits the aio-max-nr ceiling in CI during raft_uv startup (probeAsyncIO). This PR addresses that by introducing a retry loop into test_server_start (with exponential backoff). That seems to fix the problem for cluster/restart, but conn/query/one now runs into a similar issue (somewhat reproducibly), with the error originating here:

dqlite/src/raft/uv_writer.c

Lines 44 to 63 in c9eda92

    
           static int uvWriterIoSetup(unsigned n, aio_context_t *ctx, char *errmsg) 
        
           { 
        
           	int rv; 
        
           	rv = UvOsIoSetup(n, ctx); 
        
           	if (rv != 0) { 
        
           		switch (rv) { 
        
           			case UV_EAGAIN: 
        
           				ErrMsgPrintf(errmsg, 
        
           					     "AIO events user limit exceeded"); 
        
           				rv = RAFT_TOOMANY; 
        
           				break; 
        
           			default: 
        
           				UvOsErrMsg(errmsg, "io_setup", rv); 
        
           				rv = RAFT_IOERR; 
        
           				break; 
        
           		} 
        
           		return rv; 
        
           	} 
        
           	return 0; 
        
           }

This is trickier to fix because the error is propagated all the way out to the client. We don't want to put a retry loop inside dqlite or raft library code; that should live inside the test suite, wherever we call clientSendExec, etc. But this is a large refactoring that I'm not going to attempt right now.

I think aio-max-nr wasn't a problem before in raft or dqlite CI because we didn't run as many tests in parallel. Now that we have the raft tests and the dqlite tests running in the same CI job and at the same time, there's more contention for this scarce resource.

Signed-off-by: Cole Miller <[email protected]>

cole-miller · 2024-03-04T13:54:08Z

Raising aio-max-nr to 1000000 seems to reduce the rate at which these tests fail.

cole-miller force-pushed the flaky-test branch from 731e925 to ba76637 Compare February 22, 2024 16:21

cole-miller marked this pull request as ready for review February 22, 2024 16:21

cole-miller force-pushed the flaky-test branch 4 times, most recently from 7bd9e7f to 0f3ad28 Compare February 22, 2024 21:07

cole-miller requested a review from MathieuBordere February 22, 2024 21:07

cole-miller force-pushed the flaky-test branch from a927c1a to 68a1235 Compare February 23, 2024 14:49

Try raising aio-max-nr on GHA

e814a1c

Signed-off-by: Cole Miller <[email protected]>

cole-miller force-pushed the flaky-test branch from 68a1235 to e814a1c Compare March 4, 2024 13:24

cole-miller changed the title ~~Try to suss out the flaky test~~ Raise aio-max-nr on GHA Mar 4, 2024

MathieuBordere merged commit 9578db5 into canonical:master Mar 4, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raise aio-max-nr on GHA #596

Raise aio-max-nr on GHA #596

cole-miller commented Feb 20, 2024

codecov bot commented Feb 20, 2024 •

edited

Loading

cole-miller commented Feb 20, 2024 •

edited

Loading

cole-miller commented Feb 22, 2024 •

edited

Loading

cole-miller commented Mar 4, 2024

Raise aio-max-nr on GHA #596

Raise aio-max-nr on GHA #596

Conversation

cole-miller commented Feb 20, 2024

codecov bot commented Feb 20, 2024 • edited Loading

Codecov Report

cole-miller commented Feb 20, 2024 • edited Loading

cole-miller commented Feb 22, 2024 • edited Loading

cole-miller commented Mar 4, 2024

codecov bot commented Feb 20, 2024 •

edited

Loading

cole-miller commented Feb 20, 2024 •

edited

Loading

cole-miller commented Feb 22, 2024 •

edited

Loading