-
Notifications
You must be signed in to change notification settings - Fork 869
WeeklyTelcon_20200317
Geoffrey Paulsen edited this page Mar 17, 2020
·
1 revision
- Dialup Info: (Do not post to public mailing list or public wiki)
- Geoffrey Paulsen (IBM)
- Jeff Squyres (Cisco)
- Austen Lauria (IBM)
- Akshay Venkatesh (NVIDIA)
- Brian Barrett (AWS)
- Brendan Cunningham (Intel)
- David Bernhold (ORNL)
- Harumi Kuno (HPE)
- Howard Pritchard (LANL)
- Joseph Schuchart
- Josh Hursey (IBM)
- Joshua Ladd (Mellanox)
- Michael Heinz (Intel)
- Thomas Naughton (ORNL)
- Noah Evans (Sandia)
- Ralph Castain (Intel)
- Scott Breyer (Sandia?)
- Todd Kordenbrock (Sandia)
- William Zhang (AWS)
- Artem Polyakov (Mellanox)
- Edgar Gabriel (UH)
- Nathan Hjelm (Google)
- Charles Shereda (LLNL)
- George Bosilca (UTK)
- Matthew Dosanjh (Sandia)
- Brandon Yates (Intel)
- Erik Zeiske
- Mark Allen (IBM)
- Matias Cabral (Intel)
- Xin Zhao (Mellanox)
- mohan (AWS)
-
MTT -
- If you change your MTT to startup PRRTE at begining of session, and just use prun.
- Can see times cut in half or more.
- This is good, but also need to test mpirun wrapper.
- Cisco is converting half of MPI installs to use prrte/prun
-
OMPI master submodule pointers setup to track PMIx and PRRTE master.
- Jeff discussed an idea to have some integration with PRRTE that putting a string in a PRRTE PR would automatically open an Open-MPI PR to update the PRRTE submodule after that PRRTE PR is merged to PRRTE master.
Blockers All Open Blockers
Review v3.0.x Milestones v3.0.6
Review v3.1.x Milestones v3.1.6
- Brian merged in a one-sided shared memory fix, to kick MTT last night.
- Assuming everything looks good, will do the release today.
- Some questions about MTT running on v3.0.x and v3.1.x
Review v4.0.x Milestones v4.0.4
- v4.0.4 in the works.
- No Schedule yet.
- Jeff is looking at PMIx issue, some issue with dstore working with Ralph.
- May need a new PMIx v3.1.x release.
- It's a bug, but may not expose itself unless you direct launch.
- Issue 7507
- There's a one-liner code fix needed for SLURM with > ppn64
- This may drive a v4.0.4
- Right in MPI_Init (not related to specific component)
- Schedule:
- Feature Freeze: End of April
- Release: End of June
- Austen took an initial stab at issues and is starting a google sheets of v5.0 features.
- Today we went through all of the items on the google sheets document (https://docs.google.com/spreadsheets/d/1OXxoxT9P_YLtepHg6vsW3-vp4pdzGQgyknNbkzenYvw/edit#gid=0) which were taken from the face to face wiki.
- Josh Ladd led us to gather owners and a status for each of the various tasks. Not all were in attendance so we did the best we could. We can update after we get more information.
- PMIx v4.0.0
- Totalview says it's on track.
- PRRTE v2.0
- Steadily making progress. Other than Comm_Spawn, just a few more little things.
- Remove OSC pt2pt - Not straight forward.
- SUMMARY: Significant technical investigation needed.
- Intel will see about path forward.
- If we remove this Omnipath won't have a OSC component
- Timeframe is end of April
- Michael works on OMNIpath team
- It's not working for Multithreaded.
- It can crash quite a bit.
- May have data corruption issue, haven't investigated deeply
- No Issue opened.
- Nathan suggested removing this.
- Not even good reference implementation.
- TCP can use OSC_RDMA - tcp btl.
- Need to do testing with OSC UCX
- Mellanox UCX is as good as UCT, and more supportable.
- Realise on UCT so need to harden over time.
- OFI btl works
- But PSM/PSM2 are problematic.
- Mixing PSM2 MTL, and OSC_OFI_BTL is a problem.
- But non building PSM2 MTL helps.
- Intel Still wants to support PSM2 MTL, as CUDA support in OFI isn't as performance.
- SUMMARY: Significant technical investigation needed.
- SLURM PMIx plugin has been locked on PMIx v2 for some time.
- There are some NEW PMIx calls that SHOULD be added to bring it up.
- Ralph has started a PR, but needs help.
- So for now, there's some optional info that won't be passed correctly.
- No OMPI_INFO for now.
- Ralph gets pinged occasionally.
- Not sure priority of this.
- There are some NEW PMIx calls that SHOULD be added to bring it up.
- MTT on master is looking pretty good.
- Defered.
- scale-testing, PRs have to opt-into it.
Review Master Master Pull Requests
- CI testing only tests build and did it run, but doesn't test HOW it ran.
- Environment setup can be a bit different.
- For example no-permissions in
/tmp
. Might pass on one machine, and fail on another without/tmp
permissions.