-
-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
progress bars for EnsembleProblem #514
Conversation
Last time I checked none of the common loggers (the default logger in VSCode and TerminalLoggers) did summarize large numbers of progress bars. Everything would be displayed which lead to a terrible user experience IMO. |
Yea that's a fair criticism. I did it like this for a few reasons:
Maybe it'd be a good idea to contribute this log aggregation somewhere. |
I'd like to have better progress bars for ensemble simulations (not only for SciML but e.g. also in Turing where we currently only use a single progress bar for multi-chain sampling which is updated only when a full chain is sampled and hence doesn't either lead to a good user experience) but I think it's crucial to first add better support for such summaries, ideally not only in SciML but more generally for the most common loggers such as the one in VSCode or TerminalLoggers (see e.g. julia-vscode/julia-vscode#3297, and possibly also something like julia-vscode/julia-vscode#3317 and JuliaLogging/ProgressLogging.jl#23). |
A possible alternative approach here would be to make the ensembleprob's progress bar be purely based on the number of solutions in the ensemble that have finished. |
Yeah that's what we do in Turing currently. But the user experience is a bit suboptimal, in particular if you only sample a few chains in parallel - then it can happen that the progress bar is at 0 all the time and jumps to 1 when all chains are sampled at approx the same time. |
Codecov Report
@@ Coverage Diff @@
## master #514 +/- ##
==========================================
- Coverage 42.19% 41.88% -0.31%
==========================================
Files 53 53
Lines 4072 4121 +49
==========================================
+ Hits 1718 1726 +8
- Misses 2354 2395 +41
📣 Codecov offers a browser extension for seamless coverage viewing on GitHub. Try it in Chrome or Firefox today! |
I proposed for Cedar that we'd do the progress of completed simulations but indeed the granularity is quit low if you have a lot of cores or not that many problems. So IMO the correct solution is for the loggers to handle this better, or to aggregate the results internally like I suggested. |
It turns out that if you have many cores and you set |
FTR we're still investigating how to reduce overhead from progress logging, so some changes are still expected. |
Alright, performance overhead is now negligible using several strategies:
I've also added the requested weakdep on https://github.com/SciML/DiffEqBase.jl/releases/tag/v6.132.0 |
As long as https://github.com/SciML/DiffEqBase.jl/blob/master/test/downstream/inference.jl passes this should generally be good. |
I think that particular test seems to pass. Which of the other failures out of 70 test runs are my fault is hard for me to tell. |
98fee3b
to
8518c79
Compare
I rebased. We'll see how tests go. |
It looks like a backwards compat dispatch is missing: https://github.com/SciML/SciMLBase.jl/actions/runs/6479781336/job/17731842462?pr=512 |
Wait I'm confused what's going on here. I think my confusion cancelled out and maybe I did the right thing? Your comment seems to be about the stats PR, so I pushed an update there. Is there anything that needs doing for this PR? |
Can you show it on something simple like the Lorenz equation with SimpleTsit5? |
I'm looking at the docs and all I can find is how to define a rrule? https://fluxml.ai/Zygote.jl/latest/limitations/#Solutions-1
Ah! https://juliadiff.org/ChainRulesCore.jl/stable/api.html#ChainRulesCore.@ignore_derivatives |
9a79122
to
d4de969
Compare
Well that didn't work. What gives? https://github.com/SciML/SciMLBase.jl/actions/runs/6690536216/job/18176044626?pr=514#step:6:1470 |
Keeping track of all the PRs:
Well that's not what I was expecting.. @ChrisRackauckas what did you have in mind here? I think a simple solution would be to only pass down progress_id if progress=true so problems without progress support don't have to care about the kwargs |
FTR I'm getting weird precompile errors about the adjoint so maybe that's still not quite right?
|
I went back to the alternative logging workaround and did some more import fixes that were hidden by my repl state I guess. This passes the downstream AD tests for me, but I think it requires a new ChainRules release before it'll work on CI. |
ok new chainrules release is out, so restarting CI and it should pass |
this pr needs to bump the chainrules version requirement |
Last time Chris asked for a version requirement I added it and then he removed it again so idk |
I don't think this PR ever had a version requirement on ChainRules/ChainRulesCore? |
No I added one for DiffEqBase because I thought you wanted that to guarantee |
I don't see why that's related. @oscardssmith is saying you need to version bound on ChainRules/Core since you're implicitly relying on @oxinabox 's latest release for the Zygote fix on logging. |
a92e7cf
to
aebdb0c
Compare
Done. Just wanted to make sure. I'm still getting that weird DiffEqBaseZygoteExt precompile error btw |
Here are your performance numbers btw using OrdinaryDiffEq, BenchmarkTools
function lorenz(u, p, t)
σ = p[1]
ρ = p[2]
β = p[3]
du1 = σ * (u[2] - u[1])
du2 = u[1] * (ρ - u[3]) - u[2]
du3 = u[1] * u[2] - β * u[3]
return [du1, du2, du3]
end
u0 = [1.0f0; 0.0f0; 0.0f0]
tspan = (0.0f0, 10.0f0)
p = [10.0f0, 28.0f0, 8 / 3.0f0]
prob = ODEProblem{false}(lorenz, u0, tspan, p)
prob_func = (prob, i, repeat) -> remake(prob, p = rand(Float32, 3) .* p)
monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)
@benchmark sol = solve(monteprob, Tsit5(), EnsembleThreads(), trajectories = 10_000, saveat = 1.0f0, progress=false)
@benchmark sol = solve(monteprob, Tsit5(), EnsembleThreads(), trajectories = 10_000, saveat = 1.0f0, progress=true)
Logging output of 10k simulation runs after all these optimizations is just a hand full of lines
|
For reference, on master
thousands of log messages omitted for brevity ;) |
ODE->Ensemble optionally aggregate progress bars handle integer loglevel (sundials) avoid lock contention only log significant progress improve progress performance add version constraint Update Project.toml Update Project.toml ignore derivatives of logging more AD fixing attempts only pass progress_id if needed use ignore_deriviative and fix rrule for with_logger delete rules moved to ChainRules remove using import as opposed to using more import fixes more missing Logging qualifiers
aebdb0c
to
1d44a66
Compare
https://github.com/SciML/SciMLBase.jl/actions/runs/6795944602/job/18475031150?pr=514 looks like the last issue. |
This most likely got lost in a rebase
Ahhh I thought that must have been some CI glitch from the stats PR, but turns out that in one of the many rebase conflicts of this PR over its long lifetime the stats got lost. |
Seems fine now? Could it be. Could it really be done? |
Before#= ▆█▇▇▅▃▂ Memory estimate: 24.42 MiB, allocs estimate: 320057. BenchmarkTools.Trial: 89 samples with 1 evaluation. █▄ Memory estimate: 152.59 MiB, allocs estimate: 1190057. After#= ▇█▅▂▁ Memory estimate: 24.42 MiB, allocs estimate: 320076. BenchmarkTools.Trial: 31 samples with 1 evaluation.
▇▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇█▁▁▁▁▇▇▁▇▇█▁▇█▇▇█▁▇▇▇▇▇▁▁▁▇▇▁▇▁▇▇▁▁█ ▁ Memory estimate: 220.28 MiB, allocs estimate: 1785509. |
This is the first part of a series of PRs that adds progress support to EnsemblePoblem.
What this part does is if progress is enabled, initialize progress bars for every trajectory, and give every solve an unique ID.
For very large ensembles the expectation is that the logger shows this in a sensible way, for example, show the overall progress: