-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recover after error #257
Comments
I'm willing to think about how to deal with this, but my first instinct is that this will be surprisingly complex to implement. Two xyz are basically out of the question - I think the best we'll be able to do is a single output with something like an Another issue is that it'd be somewhat complex to set up a system that will stop in the middle of a calculation (you'd have to run the calculation in a separate process). If we don't, you'd need to know at calculation N that doing the N+1st will take too long, and give up before starting it. If you can make that kind of estimate, you can probably also just set up your jobs to not exceed your walltime anyway, I suspect. Again, doable, but not trivial. Can you break up your job into many jobs ( Maybe you should describe your use case more fully (number of configs, sizes of cells, heterogeneity of sizes, range of times per calculation, etc) to see if there's a way of dealing with it without adding this complexity. |
With some limitations, it may actually be more straightforward than I first thought. The actual timeout mechanism for the calculation is, I think (need to test the details), pretty straightforward with |
So I have to recalculate my trainingset with tighter DFT parameters (about 1000 configs). I'm looking at metal-oxid surfaces that have between 80 and 200 atoms. The DFT single point take normally between 1 and 3 hours per single point on one node. I calculate 10 subproccesses on ten nodes, 50 structures take about 15 hours of wall time in total. However from time to time there are a few amorphous structures that need up 10 hours for a single point. Our cluster has a 24 hour submission limit and in one case every DFT was finished except for one complex config. I solved it for the my case by parallelizing the single points within VASP over multiple nodes. But I think this would probably still be a good feature regarding user-friendliness.
I will think about this myself. If you have something to test, I can try it out. |
So by far the cleanest solution to this is to use the queuing system for what it's designed for, i.e. to run multiple jobs side by side or in sequence as resources become available, and the easiest way to do that is one queued job per configuration. Is that an option (perhaps we can make the time and/or number of nodes settable on a per-config basis), or do you have a problem such as a maliciously-configured queuing system with a small limit on the number of submitted jobs (some UK HPC - archer, I think) ? |
No I can split it up in one job per DFT (mpg HPC queing systeme is performing fine). |
I think this may actually be pretty straightforward. I'll test my idea it once I finish testing the kspacing thing. |
We should maybe in general think more carefully about how we're handling failed jobs, etc, because as it stands now, even if you did this if you wanted nice output you'd have to rerun the script with a larger walltime limit - there'd be no clean way to just give up on the very slow job. |
OK - on further thought, depending on what kind of timeouts you want to deal with, this could be quite messy. I think this needs a deep discussion first, and going through the possible use cases very carefully. I'm happy to do it, but I won't delay my other PR (the DFT calculator kspacing and non-periodic cell update) for it. Also, I have a question:
What exactly do you mean by this statement? Are you trying to do pool-based parallelism with 10 subprocesses in a job that uses 10 nodes? Can you describe more precisely how you are setting up the wfl script that's doing the database re-evaluation to try to achieve this? I ask because the normal |
So in my case the multiprocessing pool is only sending VASP run commands , waiting and retriving the results in an efficents way. |
Well, it'll try to run 10 at the same time, so the question is, if you have a 10 node allocation (which was my understanding), and you run 10 [added: I'm fairly certain that if you used plain |
No, I checked that: slurm gives me a 96-99% CPU utilization back and I also did a benchmark to check if I get the wanted time reduction. It also works if I do 2 subprocesses with So Im confident that everything works as intended. |
Weird - definitely doesn't seem to work on our slurm system. If I request an 8 node allocation and run 8 |
If I want to calculate large amounts of DFT calculations over wfl on the cluster I ran into the problem, that I reached the wall-time and the job failed. In that case I have to go through the calculation directories and recover the calculations manually.
Would it be possible to add a feature where I can give wfl an internal wall-time and it stops after i.e. 20 h returns me one xyz with successful and another xyz with failed/ non-finished calculations?
The text was updated successfully, but these errors were encountered: