-
Notifications
You must be signed in to change notification settings - Fork 876
WeeklyTelcon_20181211
Geoffrey Paulsen edited this page Jan 15, 2019
·
1 revision
- Dialup Info: (Do not post to public mailing list or public wiki)
- Geoff Paulsen
- Jeff Squyres
- Brian Barrett
- Ralph Castain
- Dan Topa (LANL)
- Edgar Gabriel
- Howard Pritchard
- Thomas Naughton
- Todd Kordenbrock
- Xin Zhao
- Nathan Hjelm
- Aravind Gopalakrishnan (Intel)
- Josh Hursey
- Matias Cabral
- Akshay Venkatesh (nVidia)
- David Bernholdt
- Geoffroy Vallee
- Joshua Ladd
- Matthew Dosanjh
- Arm (UTK)
- George
- Peter Gottesman (Cisco)
- mohan
-
Summary of PMIx re-architecturing for v5.0
-
Lots of TCP wire-up discussion
-
Session work is mostly done. Ready mid-January.
- works with MPI_Init.
- Involved a lot of cleanup for setup and shutdown.
- Can keep it as prototype, or put it in, without headers.
- For MPI_Init/MPI_Finalize only apps, fully backward compatible.
- Initialize a "default" Session.
- Asking about adding this to master in mid-January
- Part of cleanup is to have reverse setup and shutdown.
- Cleanup sounds good. Well contained. Set of pathes.
- Calling it "instances" inside of MPI, but we'll be renaming it if/when MPI standardizes sessions.
- Summary - patches for cleanup lets do them and look at them.
- Under work for sessions, need to look at a bit closer
- We can discuss sessions bindings in the future.
- Session init is all local, so timing should still be good.
Review All Open Blockers
- Schedule: posted a v2.1.6 rc1
- Driver: Assembly and locking fix, vader and pmix, etc.
- Nathan will check to ensure all atomic things in that branch.
- Issue 5932 Not all atomic fixes correctly in v2.1.x branch yet
- still happening in v2.1.6 rc1
- Something missing on v2.1.x branch
- Issue 5932 Not all atomic fixes correctly in v2.1.x branch yet
- May get ready to finish it, but not release it until January (since we're all going away).
- Nathan will check to ensure all atomic things in that branch.
Review v3.0.x Milestones v3.0.3
- Schedule:
- Scheduled 3.0.4 may of 2019
- PMIx 2.2 will be available next week
Review v3.1.x Milestones v3.1.0
- Schedule:
- Scheduled 3.1.4 april of 2019
- New PMIx available next week
Review v4.0.x Milestones v4.0.1
- Schedule: Need a quick turn around for a v4.0.1
- v4.0.0 - a few major issues:
- mpi.h is correct, but the library is not building the removed and deprecated functions because they're missing in Makefile.
- Fix in https://github.com/open-mpi/ompi/pull/6120
- Issue 6149 - Tests are fine, but needs PR6120
- Two issue hit via SPACK packaging:
- root cause may be: make -j creates TOO many threads of parallel execution on some OSes.
- max filename restrition on fortran header files.
- PR6121 master - should resolve on v4.0.x
- mpi.h is correct, but the library is not building the removed and deprecated functions because they're missing in Makefile.
- Discuss pulling PR 6110 into v4.0.1
- Bug, some OSHMEM APIs missed in v4.0.0
- Jeff pulled up slides showing that we can ADD APIs in minor versions.
- Old built executables must be able to run with newer.
- We need to verify if the patch breaks anything with older built executables.
- Because this PR is just adding functions, it should be okay.
- Mellanox volunteered to test built with old executable and run with newer OMPI
- If that test passes, everyone is okay with pulling this in.
- UCX priority PR - expecting a PR from master
- Matias Cabral local procs with OFI MTL - master this PR is okay, will be coming back to v4.0.x 6106
- Two rankfile mapper issues reported on mailing list. Howard will file issue.
- Need to create v4.0.x issues for https://www.mail-archive.com/[email protected]/msg32847.html
- @siegmargross
-
IBM mtt nightly fortran
-
IBM PGI compiler license expired.
-
Libtool issue came up before or during supercomputing.
- this goes back to v3.0 or v3.1 (can't remember what user was actually using).
- We made a backwards incompatible change to opal (not part of our ABI)
- when we bumped the version numbers in libtool, we bumped the version so you couldn't use an old libopal with a new libmpi. On basis that Apps should only link in libmpi, so it doesn't matter.
- We had a user complaining that it was failing due to link errors. After a bit for, his app was linking against lib HDF5 library which is linked against libtool which does a secondary inspection and links against libopal.
- Not really an HDF5 bug, it's a libtool issue.
- Litterly nothing we can do for v3.0.3 (or nothing we can do for v3.0.x)
- Probably want to figure out what to do here.
- Option 1: Stop installing libtool
.la
files.- Actually be "gross", have to talk to package managers, they have strong feelings.
- Option 2: Start treating those libraries as part of our ABI gaurantee.
- Option 3: Someone's flavor of libtool has a patch that they don't include the dependent library in the
.la
files.- Jeff and Brian will look at patch, and inquire upstream with libtool
- 2015 was last time libtool had an active release.
- Don't know if there's much active libtool development anyway.
- Need to feel out the libtool community about this.
- Update: There was a patch, but it caused other side-effects.
- Conclusion is we'll probably have to version all libraries.
-
Maybe it's time to discuss bringing opal and ompi back into one library, so that we only version that instead of all libs.
- May be a bit tricky, since for Python these libs now link again to top level library.
- How will this be affected by the replacing orte with prte?
- Prte is seperate, so don't really have orte.
- Compile in all components seperately and link in the final step
- We no longer have a seperate runtime to break out, so no reason to do this additional work (to try to break runtime out)
- Please think about this.
-
Still Lots of golden balls on PR's due to Amazon EWS / Jenkins
- Looks like the problem is in Jenkins (deadlocking on itself), web-interface is still up. None of the instances spin down, etc. Need to go find jenkin's bug report and see if they've made progress.
- Jenkin's still has not fixed this issue, so we can't use EWS.
- UPdate: Jenkins server is just dead.
-
What do we do about all of these Master PRs?
- We don't have a release off of master soon.
- New PRs won't go yellow-ball because don't spawn EC2 tasks (theory)
- Will still run libfabric and some other tests.
- Releasing a new version at end of week or next week.
- IBM test configure should have caused that.
- Cisco has a one-sided info check that failed a hundred times.
- Cisco install fail looks like a legit compile fail (ipv6 master)
- We have a new ibm-ompi SLACK channel for Open MPI developers.
- Not for users, just developers...
- email Jeff If you're interested in being added.
Review Master Master Pull Requests
- didn't discuss today.
Review Master MTT testing
- Mellanox, Sandia, Intel
- LANL, Houston, IBM, Fujitsu
- Amazon,
- Cisco, ORNL, UTK, NVIDIA