-
Notifications
You must be signed in to change notification settings - Fork 869
WeeklyTelcon_20211214
Geoffrey Paulsen edited this page Jan 4, 2022
·
1 revision
- Dialup Info: (Do not post to public mailing list or public wiki)
- Geoffrey Paulsen (IBM)
- Jeff Squyres (Cisco)
- Austen Lauria (IBM)
- Brendan Cunningham (Cornelis Networks)
- Brian Barrett (AWS)
- David Bernhold (ORNL)
- Edgar Gabriel (UH)
- Harumi Kuno (HPE)
- Hessam Mirsadeghi (UCX/nVidia)
- William Zhang (AWS)
- Christoph Niethammer (HLRS)
- Akshay Venkatesh (NVIDIA)
- Artem Polyakov (nVidia/Mellanox)
- Aurelien Bouteiller (UTK)
- Brandon Yates (Intel)
- Charles Shereda (LLNL)
- Erik Zeiske
- Geoffroy Vallee (ARM)
- George Bosilca (UTK)
- Howard Pritchard (LANL)
- Joseph Schuchart
- Josh Hursey (IBM)
- Joshua Ladd (nVidia/Mellanox)
- Marisa Roman (Cornelius)
- Mark Allen (IBM)
- Matias Cabral (Intel)
- Matthew Dosanjh (Sandia)
- Michael Heinz (Cornelis Networks)
- Nathan Hjelm (Google)
- Naughton III, Thomas (ORNL)
- Noah Evans (Sandia)
- Raghu Raja (AWS)
- Ralph Castain (Intel)
- Sam Gutierrez (LLNL)
- Scott Breyer (Sandia?)
- Shintaro iwasaki
- Todd Kordenbrock (Sandia)
- Tomislav Janjusic
- Xin Zhao (nVidia/Mellanox)
- Schedule: No schedule for v4.0.8 yet - sometime in 2022
- bugfixes case-by-case basis
- Schedule: No schedule for v4.1.3 yet either - sometime in 2022
- Slowing down.
- 9756 - one outstanding PR.
- Austen PRed a bunch of commits on master not yet in v5.0
- Opened two more.
- 9643 - Issue for - needs PMIx and PRRTE updates
- Submodule pointers on v5.0 need updating
- Still pointing at something on PMIx v4.1.x.
- Brian PRing some fixes so we can update to PMIx v4.2
- Issue
- If there is an SPML, we build the OSHMEM interface.
- libNBC uninitialized variable. Jeff filed 9749 this morning (prob on both master and v5.0.x)
-
Community Warm/Open to bringing in Sessions, but want to see Howard's PR later this week
-
Clock Monotomic - Jeff updated Timers.md in ompi-www
- May only be Linux and OSX - maybe just an opal_inline, doesn't warrent a whole framework
- WTIME a long time ago said not using framework.
- Everyone just needs to agree to use one function
- just need ompi_wtime (very MPI specific), wouldn't put it into opal
- just going to call clock_gettime_monotomic_raw (doesn't allow for migrating to another core)
- Maybe we should unify the times.
- No requirement that MPI_Times to be comparable to Wtick and Wtime.
- Quirks on different platforms.
- Opal_Timers really build for opal progress where we needed a 10ms with low pertibation.
-
Numa Domain in BIOS - Didn't have a chance to test the newest Open MPI v5,
- Systems where you can change the way to distribute the cores in BIOS
- Default binding. When you run more than two processes should bind to socket.
- Man pages are misleading, though they were right at the time.
- It binds to the numa domain (at the time was a one-on-one mapping with a socket)
- Might be - lstopo output and hwloc output.
-
Cisco has some test build failures.
-
Intel systems that have zero-level API - ROMIO issue in compilation
- Issue 9715 - Only workaround is to disable building ROMIO (luster perf issue)
- To fix it right, we might need to upgrade ROMIO in MPICH v4.
- This package has been rewritten.
- No configury to disble the Intel GPU support in ROMIO. This would workaround this issue.
- Is this a blocker for v5? Probably No? Perhaps Intel?
-
IBM has an OMPI build failure with XL compiler on ppc64le.
- We might need to
ompi_proc_sentinel_to_name(uintptr_t)$AF56_10. Compilation ended. Contact your Service
Representative and provide the following information: Internal abort. For more information visit:
http://www.ibm.com/support/docview.wss?uid=swg21110810
make[2]: *** [Makefile:2559: dpm/dpm.lo] Error 1
make[1]: *** [Makefile:2665: all-recursive] Error 1
make: *** [Makefile:1478: all-recursive] Error 1
- IBM's looking to workaround with Open MPI code change.
- Should we be concerned with an API break from PMIx v4.x to v5.x?
- Not sure?
- ABI things were going to break, so he wanted to break API at the same time.
- Storage spaces for strings.
- He had them all fixed stride so compilers could optimize... but not sure why.
- Not sure how to solve striding problem with variable length strings.
- There was something that was brought up previously about module pointers being fixed for v5.0 for OMPI.
- Is the long term we'll always
- Probably converging, but a few hicups
-
No discussion 12/14/2021
-
OMPI docs and manpages, but persistant problem that mpirun is really prrterun
- PMIx and PRRTE now use pandoc. It'd be bad to require both pandoc and sphynx
- Josh Hursey is wrote this up https://github.com/openpmix/prrte/issues/931, as a means to draw how to man mpirun for Open MPI
-
PR 8329 - convert README, HACKING, and possibly Manpages to restructured text.
- Uses https://www.sphinx-doc.org/en/master/ (Python tool, can pip install)
- Intent this is for v5.0
- mpirun / prrterun - we had quite a bit of details in orte, but are updating as much as possible.
- Ralph has asked about this for PMIx/PRRTE since this is turning out to work
-
No update - 3/16
- Could be independent of PMIx and PRRTE.
- PMIx and PRRTE want to follow suite, and not require both pandoc and sphynx.