Make a new Github issue when a nightly run fails #6689

danielhollas · 2025-01-09T16:43:28Z

We're running nightly tests periodically once per day:

aiida-core/.github/workflows/nightly.yml

- cron: 0 0 * * * # Run every day at midnight

When the workflow fails, the failure is posted in the aiida-core-dev Slack channel.

There are several issues with this:

The Slack channel is not public and easy to miss (as we discussed with @GeigerJ2 during coding week, people were generally not aware of it).
It's hard to track, discuss and resolve the failures in a consistent manner.

Instead of posting to a Slack channel, I would propose that failing workflow would automatically create Github issue. That might be noisy at fist, but it would force us to deal with the issues transparently.

The implementation of this is actually quite simple, I took this idea from the ruff repository:
https://github.com/astral-sh/ruff/blob/d0b2bbd55ee6435bc3dad8db2898aec216d85121/.github/workflows/daily_fuzz.yaml#L60

  create-issue-on-failure:
    name: Create an issue if the daily fuzz surfaced any bugs
    runs-on: ubuntu-latest
    needs: fuzz
    if: ${{ github.repository == 'astral-sh/ruff' && always() && github.event_name == 'schedule' && needs.fuzz.result == 'failure' }}
    permissions:
      issues: write
    steps:
      - uses: actions/github-script@v7
        with:
          github-token: ${{ secrets.GITHUB_TOKEN }}
          script: |
            await github.rest.issues.create({
              owner: "astral-sh",
              repo: "ruff",
              title: `Daily parser fuzz failed on ${new Date().toDateString()}`,
              body: "Run listed here: https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}",
              labels: ["bug", "parser", "fuzzer"],
            })

CC @unkcpz @agoscinski

The text was updated successfully, but these errors were encountered:

unkcpz · 2025-01-12T12:13:25Z

@danielhollas this would be helpful. I am testing which slow tests can goes to nightly without lower the test coverage in #6701. Hope these two changes can combine to make our life more efficient.

GeigerJ2 · 2025-01-13T17:45:58Z

Thanks for looking into this, @danielhollas! I was quite surprised when I heard the previous way was via Slack channel notifications ^^ So I'm in favor of this! More transparent, and better to deal with it directly on GitHub.
Just wondering what happens if the nightly build fails more than once because of the same problem. Judging from the YAML, I assume we would end up with one issue per day when nightly fails.

danielhollas · 2025-01-13T18:29:35Z

Just wondering what happens if the nightly build fails more than once because of the same problem. Judging from the YAML, I assume we would end up with one issue per day when nightly fails.

Good questions. Indeed, if there is a failing test that fails reproducibly, it will generate a new issue every day. A good reminder to fix it! :-D I think a more common situation will be a flaky test that fails from time to time. In the case the new tickets should be marked as duplicates and closed. WDYT? (btw: This is how the slack messages behave already).

GeigerJ2 · 2025-01-15T08:41:50Z

I think it's fine to try this out now. If it becomes too noisy, we can still always easily revert it. Being forced to quickly fix reliably failing tests is also good, I agree ^^
Though, the other situation you described seems more annoying... it's still a mystery to me why flaky tests sometimes fail. Just when I fixed the failing du test in #6702, some test-amd64 started failing for no reason, and re-running by @unkcpz solved it. What are the reasons tests fail in such an unpredictable manner? I'm aware of, e.g., GHA runners being slow sometimes and tests timing out, anything else comes to mind? (just asking out of personal curiosity)

unkcpz · 2025-01-15T08:51:33Z

Just when I fixed the failing du test in #6702, some test-amd64 started failing for no reason, and re-running by @unkcpz solved it.

docker test can fail for many reasons, but followings reasons are out of our hands and I think we can just rerun

when we reach the rate limit on dockerhub registry. I think it happened last week for aiidateam because we run many CI actions. "the pull limit is 100 pulls per 21600 seconds (6 hours)," (https://docs.docker.com/docker-hub/download-rate-limit/)
problem from github, I think when we had yesterday after your PR is problem of github, since for other CI test and readthedocs build failed with exactly the same reason.

For other reasons, we need take a look at the changes.

danielhollas added type/feature request status undecided priority/quality-of-life would simplify development topic/continuous-integration and removed type/feature request status undecided labels Jan 9, 2025

danielhollas mentioned this issue Jan 9, 2025

Consider moving slow tests to nightly test suite #6526

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make a new Github issue when a nightly run fails #6689

Make a new Github issue when a nightly run fails #6689

danielhollas commented Jan 9, 2025 •

edited

Loading

unkcpz commented Jan 12, 2025

GeigerJ2 commented Jan 13, 2025

danielhollas commented Jan 13, 2025

GeigerJ2 commented Jan 15, 2025

unkcpz commented Jan 15, 2025

Make a new Github issue when a nightly run fails #6689

Make a new Github issue when a nightly run fails #6689

Comments

danielhollas commented Jan 9, 2025 • edited Loading

unkcpz commented Jan 12, 2025

GeigerJ2 commented Jan 13, 2025

danielhollas commented Jan 13, 2025

GeigerJ2 commented Jan 15, 2025

unkcpz commented Jan 15, 2025

danielhollas commented Jan 9, 2025 •

edited

Loading