-
Notifications
You must be signed in to change notification settings - Fork 312
(3.9.0‐latest) SSH bootstrap cannot launch processes on remote host when using Intel MPI with Slurm 23.11
In ParallelCluster 3.9.0, Slurm has been upgraded to 23.11.4 (from 23.02.7).
Slurm by default supports mpirun
from Intel MPI and permits to use different [I_MPI_HYDRA_BOOTSTRAP](https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/2021-9/hydra-environment-variables.html)
mechanisms.
Slurm 23.11 changed the behaviour of mpirun
when using I_MPI_HYDRA_BOOTSTRAP=slurm
(the default), by injecting two environment variables and passing --external-launcher
option to the launcher command.
In the documentation it’s explained that it’s possible to use a different bootstrap mechanism by explicitly setting the I_MPI_HYDRA_BOOTSTRAP
environment variable prior to submitting the job with sbatch
or salloc
.
This means that if the application (e.g. Ansys Fluent) or the job submission script is using a different bootstrap launcher, without setting the I_MPI_HYDRA_BOOTSTRAP
variable, the job submission will fail with the following message:
[ec2-user@ip-10-0-0-193 ~]$ mpirun -launcher=ssh -np 1 hostname
[mpiexec@ip-10-0-0-193] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on queue1-dy-t2-1 (pid 8914, exit code 65280)
[mpiexec@ip-10-0-0-193] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@ip-10-0-0-193] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@ip-10-0-0-193] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1065): error waiting for event
[mpiexec@ip-10-0-0-193] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1027): error setting up the bootstrap proxies
[mpiexec@ip-10-0-0-193] Possible reasons:
[mpiexec@ip-10-0-0-193] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec@ip-10-0-0-193] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec@ip-10-0-0-193] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec@ip-10-0-0-193] 4. Ssh bootstrap cannot launch processes on remote host. Make sure that passwordless ssh connection is established across compute hosts.
[mpiexec@ip-10-0-0-193] You may try using -bootstrap option to select alternative launcher.
-launcher=ssh
corresponds to the undocumented -rsh=ssh
flag, with any of these you’ll receive the same error.
- ParallelCluster >= 3.9.0
- Slurm >= 23.11
The solution, as stated in the documentation is to set the I_MPI_HYDRA_BOOTSTRAP
environment variable prior to submitting the job with sbatch
or salloc
. Example:
[ec2-user@ip-10-0-0-193 ~]$ export I_MPI_HYDRA_BOOTSTRAP=ssh
[ec2-user@ip-10-0-0-193 ~]$ salloc -n1
salloc: Granted job allocation 5
[ec2-user@ip-10-0-0-193 ~]$ module load intelmpi
Loading intelmpi version 2021.9.0
[ec2-user@ip-10-0-0-193 ~]$ env | grep MPI
OMPI_MCA_plm_slurm_args=—external-launcher
I_MPI_HYDRA_BOOTSTRAP=ssh
I_MPI_ROOT=/opt/intel/mpi/2021.9.0
[ec2-user@ip-10-0-0-193 ~]$ mpirun -np 1 hostname
queue1-dy-t2-1
[ec2-user@ip-10-0-0-193 ~]$ mpirun -rsh=ssh -np 1 hostname
queue1-dy-t2-1
[ec2-user@ip-10-0-0-193 ~]$ mpirun -launcher=ssh -np 1 hostname
queue1-dy-t2-1
When using Slurm 23.11, I_MPI_HYDRA_BOOTSTRAP=slurm
is the default bootstrap system and this is the reason why the --external-launcher
param is added:
[ec2-user@ip-10-0-0-193 ~]$ salloc -n1
salloc: Granted job allocation 4
[ec2-user@ip-10-0-0-193 ~]$ env | grep MPI
I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS=--external-launcher
OMPI_MCA_plm_slurm_args=--external-launcher
I_MPI_HYDRA_BOOTSTRAP=slurm
By submitting a job with default bootstrap (slurm
) the submission works as expected
[ec2-user@ip-10-0-0-193 ~]$ module load intelmpi
Loading intelmpi version 2021.9.0
[ec2-user@ip-10-0-0-193 ~]$ mpirun -np 1 hostname
queue1-dy-t2-1
If the application launches mpirun with -rsh=ssh
or -launcher=ssh
flags, it is asking the bootstrap launcher to be ssh
rather than slurm
. If the application misses to set the I_MPI_HYDRA_BOOTSTRAP
variable, it will fail with an issue like the following:
[ec2-user@ip-10-0-0-193 ~]$ mpirun -launcher=ssh -np 1 hostname
[mpiexec@ip-10-0-0-193] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on queue1-dy-t2-1 (pid 8914, exit code 65280)
[mpiexec@ip-10-0-0-193] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@ip-10-0-0-193] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@ip-10-0-0-193] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1065): error waiting for event
[mpiexec@ip-10-0-0-193] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1027): error setting up the bootstrap proxies
[mpiexec@ip-10-0-0-193] Possible reasons:
[mpiexec@ip-10-0-0-193] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec@ip-10-0-0-193] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec@ip-10-0-0-193] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec@ip-10-0-0-193] 4. Ssh bootstrap cannot launch processes on remote host. Make sure that passwordless ssh connection is established across compute hosts.
[mpiexec@ip-10-0-0-193] You may try using -bootstrap option to select alternative launcher.
[ec2-user@ip-10-0-0-193 ~]$ mpirun -rsh=ssh -np 1 hostname
[mpiexec@ip-10-0-0-193] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on queue1-dy-t2-1 (pid 7653, exit code 65280)
[mpiexec@ip-10-0-0-193] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@ip-10-0-0-193] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@ip-10-0-0-193] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1065): error waiting for event
[mpiexec@ip-10-0-0-193] HYD_print_bstrap_setup_error_message (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1027): error setting up the bootstrap proxies
[mpiexec@ip-10-0-0-193] Possible reasons:
[mpiexec@ip-10-0-0-193] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec@ip-10-0-0-193] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec@ip-10-0-0-193] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec@ip-10-0-0-193] 4. Ssh bootstrap cannot launch processes on remote host. Make sure that passwordless ssh connection is established across compute hosts.
[mpiexec@ip-10-0-0-193] You may try using -bootstrap option to select alternative launcher.
To fix the issue is required to export the documented environment variable before submitting the job and everything will work as expected.
[ec2-user@ip-10-0-0-193 ~]$ export I_MPI_HYDRA_BOOTSTRAP=ssh
[ec2-user@ip-10-0-0-193 ~]$ salloc -n1
salloc: Granted job allocation 5
[ec2-user@ip-10-0-0-193 ~]$ env | grep MPI
OMPI_MCA_plm_slurm_args=--external-launcher
I_MPI_HYDRA_BOOTSTRAP=ssh
I_MPI_ROOT=/opt/intel/mpi/2021.9.0
[ec2-user@ip-10-0-0-193 ~]$ module load intelmpi
Loading intelmpi version 2021.9.0
[ec2-user@ip-10-0-0-193 ~]$ mpirun -np 1 hostname
queue1-dy-t2-1
[ec2-user@ip-10-0-0-193 ~]$ mpirun -rsh=ssh -np 1 hostname
queue1-dy-t2-1
[ec2-user@ip-10-0-0-193 ~]$ mpirun -launcher=ssh -np 1 hostname
queue1-dy-t2-1