Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise aio-max-nr on GHA #596

Merged
merged 1 commit into from
Mar 4, 2024
Merged

Conversation

cole-miller
Copy link
Contributor

No description provided.

Copy link

codecov bot commented Feb 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.23%. Comparing base (61fc84c) to head (e814a1c).
Report is 22 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #596      +/-   ##
==========================================
+ Coverage   81.21%   81.23%   +0.01%     
==========================================
  Files         191      192       +1     
  Lines       27033    27061      +28     
  Branches     4944     4980      +36     
==========================================
+ Hits        21956    21982      +26     
+ Misses       3483     3436      -47     
- Partials     1594     1643      +49     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@cole-miller
Copy link
Contributor Author

cole-miller commented Feb 20, 2024

Ah, it's io_setup in probeAsyncIO that's failing with EAGAIN due to aio-nr reaching aio-max-nr. I'm not sure what probeAsyncIO should actually do in this case -- it probably shouldn't just report success, but on the other hand it doesn't seem good to fail dqlite_node_create because of a temporary condition. Maybe probeAsyncIO should have a retry loop with backoff internally?

@cole-miller cole-miller marked this pull request as ready for review February 22, 2024 16:21
@cole-miller cole-miller force-pushed the flaky-test branch 4 times, most recently from 7bd9e7f to 0f3ad28 Compare February 22, 2024 21:07
@cole-miller
Copy link
Contributor Author

cole-miller commented Feb 22, 2024

Summary: the issue with the flaky test cluster/restart is that it somewhat consistently hits the aio-max-nr ceiling in CI during raft_uv startup (probeAsyncIO). This PR addresses that by introducing a retry loop into test_server_start (with exponential backoff). That seems to fix the problem for cluster/restart, but conn/query/one now runs into a similar issue (somewhat reproducibly), with the error originating here:

static int uvWriterIoSetup(unsigned n, aio_context_t *ctx, char *errmsg)
{
int rv;
rv = UvOsIoSetup(n, ctx);
if (rv != 0) {
switch (rv) {
case UV_EAGAIN:
ErrMsgPrintf(errmsg,
"AIO events user limit exceeded");
rv = RAFT_TOOMANY;
break;
default:
UvOsErrMsg(errmsg, "io_setup", rv);
rv = RAFT_IOERR;
break;
}
return rv;
}
return 0;
}

This is trickier to fix because the error is propagated all the way out to the client. We don't want to put a retry loop inside dqlite or raft library code; that should live inside the test suite, wherever we call clientSendExec, etc. But this is a large refactoring that I'm not going to attempt right now.

I think aio-max-nr wasn't a problem before in raft or dqlite CI because we didn't run as many tests in parallel. Now that we have the raft tests and the dqlite tests running in the same CI job and at the same time, there's more contention for this scarce resource.

@cole-miller cole-miller changed the title Try to suss out the flaky test Raise aio-max-nr on GHA Mar 4, 2024
@cole-miller
Copy link
Contributor Author

Raising aio-max-nr to 1000000 seems to reduce the rate at which these tests fail.

@MathieuBordere MathieuBordere merged commit 9578db5 into canonical:master Mar 4, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants