-
Notifications
You must be signed in to change notification settings - Fork 869
RuntimeDiscussion_20180718
- Dialup Info: (Do not post to public mailing list or public wiki)
- Geoff Paulsen
- Josh Hursey
- Ralph Castain
- Geoffroy Vallee
- Todd Kordenbrock
- Shinji Sumimoto
- Takahiro Kawashima
- Maurali (LLNL)
- Two Options:
-
Keep going on our current path, and taking updates to ORTE, etc.
- Two problems:
- Opal abstraction layer. Because every time you want to expose a new PMIx function, you have to do it 3 times.
-
- PMIX, 2 OPAL abstraction layer, and 3 in ORTE itself.
- Problem because extra redundant work, and also problem in terms of BUGs.
- Potential solution: Could re-do the OPAL abstraction layer. - use PMIx as the internal layer in OMPI itself.
- Would have to figure out how to write a SLURM PMI1 or PMI2 interface.
- Could call PMIX API and convert to PMI1 or PMI2 protocol for SLURM or ALPS.
- Eventually this will go away as SLURM and ALPS will implement PMIX APIS, and wont need PMI1 or PMI2 layers.
- Could say with Open MPI v5.0 that we'll only Supply a PMIx API, and those who need it can stay at OMPIv4.0
- Need to see how hard of a line we might take.
- SLURM already has a PMIx impelementation, but OLDER SLURMS will be the issue.
- At the moment, CRAY doesn't yet have a PMIX version of ALPS.
- Tools - PMI1 and PMI2 don't have tools interfaces.
-
- MPIR - if Open MPI chooses not to REMOVE in v5.0
- Orthoginal to OPAL abstraction layer issue.
- Touches ORTE and OMPI layers. - partially broken right now.
- Historiclly don't worry, and someone will fix bugs.
- Opal abstraction layer. Because every time you want to expose a new PMIx function, you have to do it 3 times.
- Two problems:
-
Shuffle our code a bit (new ompi_rte framework merged with orte_pmix frame work moved down and renamed)
- Opal used to be single process abstraction, but not as true anymore.
- API of foo, looks pretty much like PMIx API.
- Still have PMIx v2.0, PMI2 or other components (all retooled for new framework to use PMIx)
- to call just call opal_foo.spawn(), etc then you get whatever component is underneath.
- what about mpirun? Well, PRTE comes in, it's the server side of the PMIx stuff.
- Could use their prun and wrap in a new mpirun wrapper
- PRTE doesn't just replace ORTE. PRTE and OMPI layer don't really interact with each other, they both call the same OPAL layer (which contains PMIx, and other OPAL stuff).
- prun has a lam-boot looking approach.
- Build system about opal, etc. Code Shufflling, retooling of components.
- We want to leverage the work the PMIx community is doing correctly.
- ORNL OSHMEM - Having similar discussion, so This approach should work for OSHMEM as well.
- ORTED - go through opal abstraction as well.
-
PRTE - Third approach looks like lam-boot. - simply move from being inside OMPI to being inside of PRTE.
- Only way this makes sense if there is a more active community working on PRTE.
- Any hope on this becoming true? - Not really, we'd be surprised.
- Thought that when resource managers adopted
- OSHMEM community needs to have a solution. Right now extract ORTE from Open MPI.
- OSHMEM is interested in having it's own prted for it's launching.
- Thought some resources were becoming available, but a bit confusing now.
-
A slightly different question - seperating runtime project from Open MPI, either PRTE or ORTE. * One benifit of using a seperate runtime project, is that it's easier to integrate. * Like the idea of pulling runtime away from Open MPI as a seperate project. * Then runtime itself can follow it's own path and it's own release cycle. * Then Open MPI can pick a version of runtime based on quality requirements. * Having this seperate project be prte has some advantages
* Fujitsu - process manager - currently implemented and debugging PMIx in their resource manager.
-
Does Open MPI want a launcher at all?
- It used to be like this with lamboot. Users would boot something, and then
- In this path, Would say that Open MPI doesn't have a resource manager (might package PRTE).
- Other path is we ARE going to have a runtime, but who's going to have it.
- Right now, because the runtime is integrated in Open MPI, everyone has to work within this context.
- If we split the two completely,
- ORTE had to adjust for direct launch for SLURM and other direct launchers.
-
Three big questions:
- Should OMPI and OPAL move to using PMIX directly (without opal abstraction layer)
- Internal code reordering, if done correctly, it'd be transparent.
- Actually rather simple. Opal modex send/recv macros. Litterally copy those from prte, and put into a header in OMPI or OPAL.
- Already done in PRTE.
- At some point PMI1 and PMI2 conversion components - some users might see this pain.
- Any reason NOT to do this??? - PMI1 and PMI2 components don't have owners for.
- Can define this work.
- Internal code reordering, if done correctly, it'd be transparent.
- Do we have Open MPI contain ORTE as today, or pull it out into a seperate product (seperate release cycles, etc)
- How to make progress on this question???
- What do we gain by doing this?
- Those who don't need runtime life is easier.
- Those who don't need MPI is easier.
- Customers can update runtime independently from Open MPI releases. (been helpful for other launchers)
- Could have it's own quality requirements for release.
- Would like to have seperate runtime tests.
- This is the main decision.
- How do we get the stake holders to the meeting???
- Lets have another meeting like this in a month?
- How can we get a credible answer to "What's the path forward?"
- Nobody has any resources to put on it. No matter what we decide no one can do it.
- Need a clear decision from the community.
- Do we need statements of intent?
- Take ORTE out, and need a 3rd party launcher in some env.
- Leave ORTE in, and people have to step up and
- Do we have everyone call PMIx directly? Burden on non PMIx envs.
- If we Do seperate it out, what (if any) do we make default?
- Delay until we can answser #2.
We've got 3 big questions, how do we make progress?
Chicken and Egg problem, people don't see the priority yet, because they don't feel the pain yet.
-
One solution is to "expose the pain" in small increments.
-
ECP - exa-scale project for Labs.
-
If we do this, we still need people to do runtime work over in PRTE.
- In some ways it might be harder to get resources from management for yet another project.
- Nice to have a componentized interface, without moving runtime to a 3rd party project.
- Need to think about it.
-
Concerns with working adding ORTE PMIx integration.
-
Want to know the state of SLURM PMIx Plugin with PMIx v3.x
- It should build, and work with v3. They only implemented about 5 interfaces, and they haven't changed.
-
A few related to OMPIx project, talking about how much to contribute to this effort.
- How to factor in requirements of OSHMEM (who use our runtimes), and already doing things to adapt.
- Would be nice to support both groups with a straight forward component to handle both of these.
-
Thinking about how much effort this will be. and manage these tasks in a timely manor.
-
Testing, will need to discuss how to best test all of this.
Today (Geoffroy Vallee)
-
Lets take a stance and let community react?
- Move the runtime outside of the Open MPI tree, into it's own project.
- Runtime would have it's own release schedule, and meetings.
- Could drop an initial release right away.
- Switch our code to use PMIx directly and not opal abstraction layer.
- If people still want a way to start jobs, they either download a 3rd party package, or as a community we provide a packaged version of the software that gives everything at once.
- Could be packaged as 2 rpms (one with RTE, and one without RTE)
- Push this out there as what we're thinking about direction we want to go, let community respond with concerns.
- Could even call the runtime ORTE when we move it out. If we use langage carefully.
- Need to discuss with packagers after community has come to consensus.
Geoffroy Vallee will send out this writeup to devel-core by Same time next week. Follow up meeting 2 weeks from now same time.