Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ABI support for HPE's MPT and HMPT implementations #580

Merged
merged 7 commits into from
May 5, 2022

Conversation

sloede
Copy link
Member

@sloede sloede commented Apr 26, 2022

The MPT ABI is based on mpi.h and mpio.h for MPT v2.23.

cc @JBlaschke

@vchuravy vchuravy requested a review from simonbyrne April 26, 2022 19:01
@vchuravy
Copy link
Member

I suspect there is no chance that we can get a copy for CI/Yggdrasil?

@giordano
Copy link
Member

It'd be nice to list the new ABIs in https://juliaparallel.org/MPI.jl/dev/configuration/#Configuration-2 🙂

@sloede
Copy link
Member Author

sloede commented Apr 26, 2022

I suspect there is no chance that we can get a copy for CI/Yggdrasil?

It's all closed source and if you dig through the docs, there's license servers etc. mentioned everywhere. Thus I think the answer is "no" 😞

@sloede
Copy link
Member Author

sloede commented Apr 26, 2022

It'd be nice to list the new ABIs in https://juliaparallel.org/MPI.jl/dev/configuration/#Configuration-2 🙂

Thanks for the hint! Fixed in a36dc35.

@sloede
Copy link
Member Author

sloede commented Apr 26, 2022

The CI errors seem to be unrelated to the changes in the PR (e.g., this). At least some of the errors seem to be identical to the ones reported in #555.

# 3) determine the abi from the implementation + version
if (impl == "MPICH" && version >= v"3.1" ||
impl == "IntelMPI" && version > v"2014" ||
impl == "MVAPICH" && version >= v"2" ||
impl == "CrayMPICH" && version >= v"7")
impl == "CrayMPICH" && version >= v"7" ||
# https://www.mpich.org/abi/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment should go at the line below?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. The website specifically lists which MPI libraries starting which version are compatible to MPICH. HMPT is not listed on that website, however, I know from the docs that it is supposed to be compatible. Thus I tried to exclude it by putting it below the existing comment, but I can also move it up again. What do you think?

@vchuravy
Copy link
Member

Just a word of warning, you will likely run into JuliaPackaging/JLLWrappers.jl#40

Try adding LAMMPS.jl or LAMMPS_jll

@sloede
Copy link
Member Author

sloede commented Apr 27, 2022

OK, so I am still running into issues I cannot explain:

When I am trying to run the most simple MPI program with this branch, mpirun -n 2 julia -e 'using MPI; MPI.Init()', I get the following error:

MPT ERROR: In xpmem_attach: Rank 0 could not attach 140444502511616 static bytes
, rc = -1, errno = 12
MPT ERROR: In xpmem_attach: Rank 1 could not attach 140444502511616 static bytes
, rc = -1, errno = 12
MPT ERROR: Rank 1(g:1) is aborting with error code 0.
	Process ID: 4119198, Host: hawk-login03, Program: /zhome/academic/HLRS/hlrs/hpcschlo/.pool/julia/1.7.2/bin/julia
	MPT Version: HPE MPT 2.23  08/26/20 02:54:49-root

MPT: --------stack traceback-------
MPT ERROR: MPI_COMM_WORLD rank 0 has terminated without calling MPI_Finalize()
	aborting job

I have not seen this error before with the registered versions of MPI.jl.

To reproduce, here's what I did:

  • Run
    export JULIA_DEPOT_PATH="$HOME/.julia-mpt-master"
    module load mpt
    export MPI_SHEPHERD=true
  • Clone this PR:
    git clone [email protected]:sloede/MPI.jl.git -b msl/add-mpt-abi
  • Dev MPIPreferences.jl and MPI.jl from this PR:
    julia -e 'using Pkg; Pkg.develop(path="MPI.jl/lib/MPIPreferences"); Pkg.develop(path="MPI.jl")'
  • Switch to system binary with julia -e 'using MPI; MPI.use_system_binary()', which should give you an output similar to this:
    ┌ Info: MPI implementation
    │   libmpi = "libmpi"
    │   version_string = "HPE MPT 2.23  08/26/20 02:54:49-root"
    │   impl = "HPE MPT"
    │   version = v"2.23.0"
    └   abi = "HPE MPT"
    ┌ Warning: The underlying MPI implementation has changed. You will need to restart Julia for this change to take effect
    │   libmpi = "libmpi"
    │   abi = "HPE MPT"
    │   mpiexec = "mpiexec"
    └ @ MPIPreferences ~/hackathon/msl-MPI.jl/lib/MPIPreferences/src/MPIPreferences.jl:122
  • Finally, trigger error with mpirun -n 2 julia -e 'using MPI; MPI.Init()'

Has anyone seen something like this before with the current master? As stated before, I have not had any issues with running MPI using MPI.jl v19.x.

@JBlaschke Could you try to reproduce this on PM?

@eschnett
Copy link
Contributor

@sloede That buffer size 140444502511616 looks like a pointer (0x00007fbbc8aed000). Maybe your Julia declarations get confused between the value of and pointer to a constant?

@sloede
Copy link
Member Author

sloede commented Apr 28, 2022

@sloede That buffer size 140444502511616 looks like a pointer (0x00007fbbc8aed000). Maybe your Julia declarations get confused between the value of and pointer to a constant?

That's a good point. I could create an MWE by running

MPI_SHEPHERD=true \
mpirun -n 1 julia -e '
required = Cint(2);
provided = Ref{Cint}();
ccall((:MPI_Init_thread, :libmpi), Cint, (Ptr{Cint},Ptr{Cvoid}, Cint, Ref{Cint}), C_NULL, C_NULL, required, provided)'

Note that the correct libmpi.so is on the LD_LIBRARY_PATH, which is why I assume that using :libmpi should work. Running the above yields the following error:

MPT ERROR: In xpmem_attach: Rank 0 could not attach 140700025802752 static bytes
, rc = -1, errno = 12
MPT ERROR: Rank 0(g:0) is aborting with error code 0.
	Process ID: 683280, Host: hawk-login04, Program: /zhome/academic/HLRS/hlrs/hpcschlo/.pool/julia/1.7.2/bin/julia
	MPT Version: HPE MPT 2.23  08/26/20 02:54:49-root

MPT: --------stack traceback-------
MPT: Attaching to program: /proc/683280/exe, process 683280
MPT: [New LWP 683281]
MPT: [New LWP 683282]
MPT: BFD: warning: /zhome/academic/HLRS/hlrs/hpcschlo/.pool/julia/1.7.2/bin/../lib/julia/libgcc_s.so.1: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010001
MPT: BFD: warning: /zhome/academic/HLRS/hlrs/hpcschlo/.pool/julia/1.7.2/bin/../lib/julia/libgcc_s.so.1: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010002
MPT: BFD: warning: /zhome/academic/HLRS/hlrs/hpcschlo/.pool/julia/1.7.2/bin/../lib/julia/libgfortran.so.5: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010001
MPT: BFD: warning: /zhome/academic/HLRS/hlrs/hpcschlo/.pool/julia/1.7.2/bin/../lib/julia/libgfortran.so.5: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010002
MPT: BFD: warning: /zhome/academic/HLRS/hlrs/hpcschlo/.pool/julia/1.7.2/bin/../lib/julia/libquadmath.so.0: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010001
MPT: BFD: warning: /zhome/academic/HLRS/hlrs/hpcschlo/.pool/julia/1.7.2/bin/../lib/julia/libquadmath.so.0: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010002
MPT: [Thread debugging using libthread_db enabled]
MPT: Using host libthread_db library "/lib64/libthread_db.so.1".
MPT: 0x00007ff778e71ae2 in waitpid () from /lib64/libc.so.6
MPT: warning: File "/sw/hawk-rh8/hlrs/non-spack/compiler/gcc/9.2.0/lib64/libstdc++.so.6.0.27-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
MPT: To enable execution of this file add
MPT: 	add-auto-load-safe-path /sw/hawk-rh8/hlrs/non-spack/compiler/gcc/9.2.0/lib64/libstdc++.so.6.0.27-gdb.py
MPT: line to your configuration file "/zhome/academic/HLRS/hlrs/hpcschlo/.gdbinit".
MPT: To completely disable this security protection add
MPT: 	set auto-load safe-path /
MPT: line to your configuration file "/zhome/academic/HLRS/hlrs/hpcschlo/.gdbinit".
MPT: For more information about this security protection see the
MPT: "Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
MPT: 	info "(gdb)Auto-loading safe path"
MPT: Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-151.el8.x86_64 libibverbs-41mlnx1-OFED.4.9.3.0.0.49417.x86_64 libmlx4-41mlnx1-OFED.4.7.3.0.3.49417.x86_64 libmlx5-41mlnx1-OFED.4.9.0.1.2.49417.x86_64 libnl3-3.5.0-1.el8.x86_64 numactl-libs-2.0.12-11.el8.x86_64
MPT: (gdb) #0  0x00007ff778e71ae2 in waitpid () from /lib64/libc.so.6
MPT: #1  0x00007ff747319f56 in mpi_sgi_system (
MPT: #2  MPI_SGI_stacktraceback (
MPT:     header=header@entry=0x7ffc97f72db0 "MPT ERROR: Rank 0(g:0) is aborting with error code 0.\n\tProcess ID: 683280, Host: hawk-login04, Program: /zhome/academic/HLRS/hlrs/hpcschlo/.pool/julia/1.7.2/bin/julia\n\tMPT Version: HPE MPT 2.23  08/26"...) at sig.c:340
MPT: #3  0x00007ff747248e33 in print_traceback (ecode=ecode@entry=0) at abort.c:246
MPT: #4  0x00007ff747248ffe in MPI_SGI_abort () at abort.c:122
MPT: #5  0x00007ff7472d961e in error_chk (str=<optimized out>, x=-1) at memmap.c:106
MPT: #6  error_chk (str=<optimized out>, x=-1) at memmap.c:99
MPT: #7  attach_seg (identifier=identifier@entry=0x7ff7473a6519 "static",
MPT:     target_mem_addr=<optimized out>, len=140700025802752,
MPT:     gps=gps@entry=0x29637b0) at memmap.c:458
MPT: #8  0x00007ff7472da4d2 in do_attaches () at memmap.c:499
MPT: #9  MPI_SGI_memmap_conn_slave () at memmap.c:526
MPT: #10 0x00007ff74724cecb in slave_init (i=<optimized out>) at adi.c:303
MPT: #11 fork_slaves () at adi.c:763
MPT: #12 MPI_SGI_create_slaves () at adi.c:814
MPT: #13 0x00007ff74724d8ca in MPI_SGI_init () at adi.c:992
MPT: #14 0x00007ff74724dcef in MPI_SGI_init () at adi.c:848
MPT: #15 0x00007ff74724e0a7 in MPI_SGI_misc_init (required=required@entry=2,
MPT:     provided=provided@entry=0x7ff75c458220) at adi.c:1237
MPT: #16 0x00007ff7472cc4a3 in PMPI_Init_thread (argc=<optimized out>,
MPT:     argv=<optimized out>, required=2, provided=0x7ff75c458220)
MPT:     at init_thread.c:47
MPT: #17 0x00007ff75047a5e9 in ?? ()
MPT: #18 0x0000000000007a58 in ?? ()
MPT: #19 0x00007ff7530d9700 in jl_system_image_data ()
MPT:    from /zhome/academic/HLRS/hlrs/hpcschlo/.pool/julia/1.7.2/lib/julia/sys.so
MPT: #20 0x00007ff7533b2d80 in jl_system_image_data ()
MPT:    from /zhome/academic/HLRS/hlrs/hpcschlo/.pool/julia/1.7.2/lib/julia/sys.so
MPT: #21 0x00007ff75c458220 in ?? ()
MPT: #22 0x0000000000000008 in ?? ()
MPT: #23 0x00007ffc97f73bc0 in ?? ()
MPT: #24 0x00007ff75c458220 in ?? ()
MPT: #25 0x00007ff76025a090 in ?? ()
MPT: #26 0x0000000000000000 in ?? ()
MPT: (gdb) A debugging session is active.
MPT:
MPT: 	Inferior 1 [process 683280] will be detached.
MPT:
MPT: Quit anyway? (y or n) [answered Y; input not from terminal]
MPT: Detaching from program: /proc/683280/exe, process 683280
MPT: [Inferior 1 (process 683280) detached]

MPT: -----stack traceback ends-----
MPT ERROR: MPI_COMM_WORLD rank 0 has terminated without calling MPI_Finalize()
	aborting job
Killed

Unfortunately, I do not see or understand where my mistake could be. Even less do I understand why this fails now and in the new master, while in v0.19.1 we did not seem to have an issue here. Prefixing LD_LIBRARY_PATH with the Julia lib dir did not help either.

@vchuravy Was there anything changed in how the MPI library is loaded/prepared between the last registered version and the current master? That is, was there some "special ingredients" that got removed in master?

@sloede sloede changed the title WIP: Add ABI support for HPE's MPT and HMPT implementations Add ABI support for HPE's MPT and HMPT implementations May 2, 2022
@sloede sloede marked this pull request as ready for review May 2, 2022 20:18
@sloede
Copy link
Member Author

sloede commented May 2, 2022

With all errors I have encountered so far being resolved (thanks to #592), from my side this PR is ready for review.

@JBlaschke have you had a chance to test this out on Perlmutter yet?

@simonbyrne simonbyrne merged commit 89875d4 into JuliaParallel:master May 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants