-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenMPI doesn't work when docker is running #1
Comments
It appears that this occurs because openmpi tries to use the virtual network interface that is set up for the docker container. This is the interface with IP 172.17.0.1 in the verbose log. It is not clear what we should do to avoid this. |
To prevent OMPI from using a specific IP interface you can do |
Yes, but I'm assuming that you want openmpi to work without the users of our systems all having to know this an always run with this? |
You can set this in the Open MPI mca param file for the installation $OMPI_PREFIX/etc/openmpi-mca-params.conf.
btl_tcp_if_exclude=docker0,virbr0
Disadvantage is that this need to be done for all installs and will not carry over to user-compiled open MPI. This is also the advantage (having implicit things carry-over to user installs can be confusing).
Aurelien
… On Jul 27, 2023, at 10:59, G-Ragghianti ***@***.***> wrote:
Yes, but I'm assuming that you want openmpi to work without the users of our systems all having to know this an always run with this?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.
|
Indeed, there is what I want and then there is what is possible. Is there a consistent way to identify the interfaces created by dockers or interfaces that are virtual and cannot be used for data exchanges ? Unfortunately the answer is no, and thus either the users/sys admin provide the correct configuration files (either user or system wide MCA param) or we will be reliant on the system timeout (btw, the execution did not deadlock it is just waiting for the timeout to signal that the interface cannot be used, and the default timeout is extremely long). |
Yes, disabling the docker0 interfaces avoids the problem. I would have to think of the best way to set this. This would not be very clean to manually set it within the spack openmpi install directory, but it looks like it doesn't look anywhere else for the conf file. Also, I'm confused why openmpi isn't using vader/sm. Even if I set "--mca btl self,vader" it doesn't work correctly (doesn't run the osu_bcast):
|
All these output messages are from PMIX and not from OMPI. So based on these we cannot conclude if vader/sm was or not used. Use |
OK:
|
OB1 is selected, so all BTLs should be up and running, if you did not specifically excluded them (with |
We should investigate an upgrade of UCX to latest and Open MPI to 5.0.2, that may have resolved these problems. |
I have scheduled a rebuild of the module that will be placed in a new location (date code 2024-03-01). |
I'm building a new software module set of the latest [email protected] and [email protected], but the changes in UCX are scheduled for 1.16. |
There is a problem with updating to openmpi@5 on our newer systems. The systems use [email protected] (required by slurm), but there is an incompatibility with this pmix version and openmpi version 5. It would be possible to use an "internal" pmix in openmpi, but I don't know if it will work with slurm then. Ideas? |
Using openmpi's internal pmix, this is available to test on login.icl.utk.edu: export MODULEPATH=/apps/spacks/2024-03-05/share/spack/modules/linux-rocky9-x86_64 |
Using that open mpi works as expected except for the following warning message
This can be resolved by installing the munge package (from the slurm installation rpms, it doesn't get installed automatically in the client image when installing slurm, but it should). |
Problem: When a docker container is running, simple OpenMPI jobs cannot run using the tcp interface. For example, a broadcast test will hang.
Steps to reproduce:
Expected result:
Verbose output:
The text was updated successfully, but these errors were encountered: