Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Book test sometimes timeout (bad random seed?) #4493

Open
fingolfin opened this issue Jan 23, 2025 · 2 comments
Open

Book test sometimes timeout (bad random seed?) #4493

fingolfin opened this issue Jan 23, 2025 · 2 comments

Comments

@fingolfin
Copy link
Member

I noticed that sometimes our booktests timeout after 2.5 hours / 150 minutes, e.g. here for PR #4490 while usually they finish with lots of room to spare; e.g. here in 40 minutes.

In this case, the last message from the test runner in the logs before a gazillion messages of the form From worker 7: GC: pause 15.33ms. collected 52.249626MB. incr was this:

2025-01-22T23:08:56.8299319Z       From worker 7:	    vinberg_2.jlcon

Not having looked at when this filename is printed, I don't know if this indicates we are running the tests in vinberg_2.jlcon, or perhaps in the file after it.

But anyway, my suspicion is that there are some tests here which "usually" work but for some RNG seed states end up performing much worse than usual.

Perhaps we should reduce fluctuation in this test by forcing a specific seed? Or do we already set a seed -- then this might a sign of another source of randomness we do not yet control for (e.g. when Singular is forking to achieve parallelism, I am guessing the order in which child processes finish their task may affect the overall result)

Also it may be useful if someone dug into it to figure out which part causes the issue. Since we parse the .jlcon files to process them step-by-step, perhaps we could enable a debug mode where it prints out which line (number) it is currently processing?

@benlorenz
Copy link
Member

I have also noticed this, this has been happening for quite a while.

The filename is indeed printed directly before that jlcon is run.

We do set a fixed seed before every chapter:

run_repl_string(mockrepl, """Oscar.randseed!(42);""")

So in theory the runs should be deterministic. There are several chapters with random output where this does work as expected.

I can add a debug mode to print the executed lines.

@benlorenz
Copy link
Member

Directly in the run_repl_string we can only print the whole jlcon file but not each input line, since we basically paste the whole jlcon file to the repl (in bracketed paste mode).

I started working on a PR to help debugging this by adding a hangcheck timer: #4504

And we did in fact hit exactly the relevant issue in this run:
https://github.com/oscar-system/Oscar.jl/actions/runs/12988216014/job/36218721340?pr=4504

But the hangcheck triggered only once after about 10 minutes, at vinberg_2.jlcon line 153, julia> B = [basis_representation(Y2, D) for D in pullbackDivY1];.
It did not trigger at all after this, so I would assume that we are either in some tight loop or in non-julia code which does not allow the timer to run again (somewhere after that line 153...).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants