-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Samples can be lost if interrupted during add_live_points #490
Comments
Thanks for the report! One problem also is that I generally avoided touching the plotting code in the past (because it's not that well tested), so I'm less familiar with it. |
This is being run through Bilby and we have our own checkpointing method, it basically comes down to pickle dumping the dynesty sampler object. |
Okay. Then it's a bit of a problem. |
The main reason I haven't switched to using the new dynesty checkpoint system is that it doesn't allow on-demand checkpoints (presumably for this reason) such as when running on shared computing resources that may remove the job at random intervals. I would really appreciate it if you would consider the one-line change proposed. |
I'll try to generate a MWE. |
I will need to look at the change in more detail, as I need to be convinced that's correct for dynesty on it own. |
Here's an example that emulates being interrupted by a manual keyboard interrupt. Stopping during add_live is quite rare as it is fairly fast, in practice I'm seeing it happen for a few % of long-running analyses that get interrupted up to a few tens of times each, so maybe a little less than one percent of the time. Admittedly, in If an option to trace/run plots during a checkpoint (or on some other cadence,) I think that could simplify how we use the We have a checkpoint if signaled method implemented, which is similar in spirit to the method below using import sys
import dynesty
import numpy as np
from dynesty.plotting import runplot
# setup taken from one of the examples in the repository
rstate = np.random.default_rng(56101)
m_true = -0.9594
b_true = 4.294
f_true = 0.534
N = 50
x = np.sort(10 * rstate.uniform(size=N))
yerr = 0.1 + 0.5 * rstate.uniform(size=N)
y_true = m_true * x + b_true
y = y_true + np.abs(f_true * y_true) * rstate.normal(size=N)
y += yerr * rstate.normal(size=N)
def loglike(theta):
m, b, lnf = theta
model = m * x + b
inv_sigma2 = 1.0 / (yerr**2 + model**2 * np.exp(2 * lnf))
return -0.5 * (np.sum((y-model)**2 * inv_sigma2 - np.log(inv_sigma2)))
# prior transform
def prior_transform(utheta):
um, ub, ulf = utheta
m = 5.5 * um - 5.
b = 10. * ub
lnf = 11. * ulf - 10.
return m, b, lnf
# Read a checkpoint if it exists as a test of the failure, try making a run plot
# we should see a warning if there is a failure
try:
dsampler = dynesty.utils.restore_sampler("checkpoint.pkl")
dres = dsampler.results
print(dres["niter"], len(dres["logl"]))
if dsampler.added_live:
try:
runplot(dres)
except ValueError:
pass
except FileNotFoundError:
dsampler = dynesty.NestedSampler(
loglike, prior_transform, ndim=3, bound='multi', sample='rwalk', rstate=rstate, nlive=1000)
try:
for _ in range(10):
if dsampler.added_live:
dsampler._remove_live_points()
dsampler.run_nested(checkpoint_file="checkpoint.pkl", dlogz=0.1, resume=True, maxiter=1000)
except KeyboardInterrupt:
# If we've interrupted while adding live points, we land in the bug
print(dsampler.added_live)
dynesty.utils.save_sampler(dsampler, "checkpoint.pkl")
sys.exit(1) For testing, I added a sleep into the loop in |
I think a change to call |
Thank you for providing an example. Two points here
So I think what needs to be done is understand how you can fit with the existing approach that dynesty uses already (i.e. where consistent state of the sampler can be obtained/and saved at specific moments) |
I understand the general concern and will adapt my use case to be more resilient to this issue. Personally, I think it is still worth addressing this specific issue with removing added live points, but feel free to close if you disagree. |
I will close as I disagree with bilby-dev/bilby#872 that it is a bug. |
Dynesty version
initially noticed in 2.1.4 via conda and verified on master
Describe the bug
When removing added live points the last
self.nlive
points are removed.However, if the
add_live_points
method is interrupted early for some reasonSetup
The application I have where this is causing an issue is jobs being interrupted on a cluster scheduled with HTCondor and so reproducibility is difficult.
I've verified that I can create the issue by manually changing
add_live_points
to only yield a subset of the current live points.In case it is relevant I'm not using the built-in dynesty checkpointing, but a different system that predates it.
Dynesty output
The main thing users will see is the following warning being triggered if attempting to make a
runplot
after this.Proposed solution
Since
add_live_points
doesn't incrementself.it
, I think it should be safe to just change the line linked above to beI've tested this and it works fine (the
runplot
still complains).It seems to work for the dynamic sampler, but I'm less familiar with that, so I'm not sure if there are additional subtleties.
Happy to open a PR if you're happy with this solution @segasai
The text was updated successfully, but these errors were encountered: