-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updates to use pyaestro to back a local pool style adapter. #345
base: develop
Are you sure you want to change the base?
Conversation
@jwhite242 -- Some considerations based on why we were discussing earlier:
|
Current todo: Cancellation works at the Maestro level, but leaves the pool executing the last processes that were running before the cancellation was posted. This bug boils up to a pyaestro bug. |
""" | ||
return f"#!{self._exec}" | ||
|
||
def get_parallelize_command(self, procs, nodes=None, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we generalize this to execute or launch command rather than making parallel special? More just about a hook for a custom launch prefix/suffix/wrapper/etc than parallel in particular right? One scheduler thing i can see used is big memory jobs on hpc might request a whole node just for the memory, but only run serial (e.g. srun -N1 -n1..).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah -- I want to redesign this in pyaestro
and then use it here.
identifier. | ||
""" | ||
try: | ||
job_id = self._pool.submit(path, cwd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there a future being tracked in here somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pool is entirely backed by pyaestro
-- which does implement this with a future.
return JobStatusCode.ERROR, {} | ||
|
||
status = {jid: State.UNKNOWN for jid in joblist} | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better to put this inside the loop so it can actually get through the whole list and catch all the exceptions? Also, what exception did this get added to catch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this outside the loop since it's a basic operation -- also because the Enum
is not a constructor, I couldn't use defaultdict
like I wanted. I plan to explore making a meta-enum to handle the default case to make this more Pythonic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In regards to the except
-- it's there to log any possible errors that do come up. It's not necessarily meant to take corrective action.
|
||
c_return = CancelCode.OK | ||
ret_code = 0 | ||
if ExecCancel.FAILED in c_status: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this check all of them, i.e. counting the number of cancels that failed and ones that succeeded separately for more informative error messages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This design choice is a decision carried forward from SLURM (when I initially made this); in that perspective, the status is reported if at least one record reports the state. The issue with things like SLURM is that it's more efficient to do a bulk cancel, so there's no way to get a per job report out there (not without polling the scheduler to see if it took).
elif executor_state == ExecTaskState.INITIALIZED: | ||
return State.INITIALIZED | ||
elif executor_state == ExecTaskState.CANCELLED: | ||
return State.CANCELLED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about a CLOSING or CANCELLING status? I've noticed at least on slurm that if you check it again too quick after cancelling (sometimes up to a minute) you'll get the CG status from the scheduler while it's still cleaning up/shutting it down. Or can pending capture this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current implementation, those don't exist. The pool technically will cancel a job (in this case a process) outright.
batch: | ||
shell: /bin/bash | ||
type: local_pool | ||
num_workers: 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'll definitely need to spend some effort on whatever nomenclature is settled on for this stuff to avoid confusion with workers/cores/tasks/jobs/...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I rename this to num_processes
instead? I feel like that would be a better description that doesn't alias to other things.
So I was thinking this could be a useful distinction based on the two possible run modes: standalone batch jobs (the current scheduler adapters) and running maestro inside of an allocation manually for job packing (or spawned by another tool built on top of maestro). Doesn't necessarily need to use local as the distinction, but do we need some hook to enable the latter job packing mode? As for the job packing, this is something where i think plugins would be really useful, or some way for users to write these things like with pgen. The number of scheduling behaviors and the optimization algorithms for implementing them is a pretty large space that doesn't necessarily need to be hard wired into maestro. A few simple/interesting ones would be good of course. |
I was thinking about this and the case of running Maestro in an allocation would be where you have an "allocation adapter" class that comes in handy. That class would take in the global set of resources requested and schedule the conductor call. That's actually where I wanted to split out the MPI related functionality from the For this PR, I was thinking if I can get the local pool as the main local adapter (it would still return that it's local, but in the I'm starting to wonder if this is a good time to introduce an Hopefully this is making sense. |
This PR is a preliminary step towards a more "ensemble"-ish adapter related to the functionality discussed in #330.
This new
LocalPoolAdapter
utilizes pyaestro'sExecutor
class in order to fake a scheduler. Some potential future additions are timeouts and resource support for the above #330.