MPI parallel code not running via slurm on BM, but running via login #2

jazzquezz · 2019-08-07T15:17:11Z

hi guys,
i have a problem as user of the Oracle Cloud Infrastructure, let's see if anyone can help.
I have a binary compiled in the login node, a parallel code which uses MPI heavily. I have a slurm script that submit the job loading firstly some modules. What is strange is that if I sbatch the script specifying a BM instance already up and running, i have an error at the MPI init, i.e. at the very beginning. If i do the same to a VM, all works fine. All works fine also if I log in directly to the BM, load the same modules, and run the binary using "mpirun -np ..."

It seems that there is a problem with MPI through slurm in the BM... any hint?

I attach here the slurm script.

thanks!

#!/bin/bash
#SBATCH --job-name="combo"
#SBATCH --time=02:00:00
#SBATCH --ntasks=64
#SBATCH --threads-per-core=1
#SBATCH --output=/mnt/shared/ELEM/data/scars-darrel-test/slurm_outputs/output_%J.out
#SBATCH --error=/mnt/shared/ELEM/data/scars-darrel-test/slurm_outputs/output_%J.err

#### I use this only to test the case when starting the BM beforehand

#SBATCH --nodelist=bm-standard-e2-64-ad1-0003

module purge
module load hwloc
module load pmix
module load prun/1.3
module load gnu8/8.3.0
module load openmpi3/3.1.4
module load ohpc
module load Python/3.6.6-foss-2018b

set -eo pipefail -o nounset
source /etc/profile.d/lmod.sh

export folderdata=/mnt/shared/ELEM/data/scars-darrel-test

export foldertemplate=${folderdata}/data_in
export folderin=${folderdata}/data_in_${SLURM_JOB_ID}
export foldergeom=${folderdata}/geom_in
export folderout=${folderdata}/resu_${SLURM_JOB_ID}
export foldervtkgeom=${folderdata}/vtk-geom-definition

export probname=wedge_scars

export binalya=/mnt/shared/ELEM/bm-standard-e2-64-ad1-0001-cosas/mariano-exmedi-ohara-alya2/Executables/unix/Alya.g

mkdir -p ${folderin}
cp -r ${foldertemplate}/*  ${folderin}/.

echo '--|JOB STARTING AT: ' `date`
echo '--|   ALYA: STARTING AT: ' `date`
cd ${folderin}

#### I get the error after this:

time -p srun --mpi=pmix ${binalya} ${probname}

echo '--|   ALYA: FINISHED AT: ' `date`
echo '--|JOB FINISHED: ' `date`

milliams transferred this issue from clusterinthecloud/installer Jun 25, 2020

milliams added the Oracle label Jun 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPI parallel code not running via slurm on BM, but running via login #2

MPI parallel code not running via slurm on BM, but running via login #2

jazzquezz commented Aug 7, 2019

MPI parallel code not running via slurm on BM, but running via login #2

MPI parallel code not running via slurm on BM, but running via login #2

Comments

jazzquezz commented Aug 7, 2019