Book test sometimes timeout (bad random seed?) #4493

fingolfin · 2025-01-23T09:05:29Z

I noticed that sometimes our booktests timeout after 2.5 hours / 150 minutes, e.g. here for PR #4490 while usually they finish with lots of room to spare; e.g. here in 40 minutes.

In this case, the last message from the test runner in the logs before a gazillion messages of the form From worker 7: GC: pause 15.33ms. collected 52.249626MB. incr was this:

2025-01-22T23:08:56.8299319Z       From worker 7:	    vinberg_2.jlcon

Not having looked at when this filename is printed, I don't know if this indicates we are running the tests in vinberg_2.jlcon, or perhaps in the file after it.

But anyway, my suspicion is that there are some tests here which "usually" work but for some RNG seed states end up performing much worse than usual.

Perhaps we should reduce fluctuation in this test by forcing a specific seed? Or do we already set a seed -- then this might a sign of another source of randomness we do not yet control for (e.g. when Singular is forking to achieve parallelism, I am guessing the order in which child processes finish their task may affect the overall result)

Also it may be useful if someone dug into it to figure out which part causes the issue. Since we parse the .jlcon files to process them step-by-step, perhaps we could enable a debug mode where it prints out which line (number) it is currently processing?

The text was updated successfully, but these errors were encountered:

benlorenz · 2025-01-23T09:27:24Z

I have also noticed this, this has been happening for quite a while.

The filename is indeed printed directly before that jlcon is run.

We do set a fixed seed before every chapter:

Oscar.jl/test/book/test.jl

Line 255 in 0cfe803

run_repl_string(mockrepl, """Oscar.randseed!(42);""")

So in theory the runs should be deterministic. There are several chapters with random output where this does work as expected.

I can add a debug mode to print the executed lines.

benlorenz · 2025-01-27T15:46:04Z

Directly in the run_repl_string we can only print the whole jlcon file but not each input line, since we basically paste the whole jlcon file to the repl (in bracketed paste mode).

I started working on a PR to help debugging this by adding a hangcheck timer: #4504

And we did in fact hit exactly the relevant issue in this run:
https://github.com/oscar-system/Oscar.jl/actions/runs/12988216014/job/36218721340?pr=4504

But the hangcheck triggered only once after about 10 minutes, at vinberg_2.jlcon line 153, julia> B = [basis_representation(Y2, D) for D in pullbackDivY1];.
It did not trigger at all after this, so I would assume that we are either in some tight loop or in non-julia code which does not allow the timer to run again (somewhere after that line 153...).

benlorenz mentioned this issue Jan 27, 2025

booktest: add hangcheck timer to print current file+line, and later backtrace #4504

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Book test sometimes timeout (bad random seed?) #4493

Book test sometimes timeout (bad random seed?) #4493

fingolfin commented Jan 23, 2025

benlorenz commented Jan 23, 2025

benlorenz commented Jan 27, 2025

Book test sometimes timeout (bad random seed?) #4493

Book test sometimes timeout (bad random seed?) #4493

Comments

fingolfin commented Jan 23, 2025

benlorenz commented Jan 23, 2025

benlorenz commented Jan 27, 2025