Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI parallel code not running via slurm on BM, but running via login #2

Open
jazzquezz opened this issue Aug 7, 2019 · 0 comments
Open
Labels

Comments

@jazzquezz
Copy link

hi guys,
i have a problem as user of the Oracle Cloud Infrastructure, let's see if anyone can help.
I have a binary compiled in the login node, a parallel code which uses MPI heavily. I have a slurm script that submit the job loading firstly some modules. What is strange is that if I sbatch the script specifying a BM instance already up and running, i have an error at the MPI init, i.e. at the very beginning. If i do the same to a VM, all works fine. All works fine also if I log in directly to the BM, load the same modules, and run the binary using "mpirun -np ..."

It seems that there is a problem with MPI through slurm in the BM... any hint?

I attach here the slurm script.

thanks!


#!/bin/bash
#SBATCH --job-name="combo"
#SBATCH --time=02:00:00
#SBATCH --ntasks=64
#SBATCH --threads-per-core=1
#SBATCH --output=/mnt/shared/ELEM/data/scars-darrel-test/slurm_outputs/output_%J.out
#SBATCH --error=/mnt/shared/ELEM/data/scars-darrel-test/slurm_outputs/output_%J.err

#### I use this only to test the case when starting the BM beforehand

#SBATCH --nodelist=bm-standard-e2-64-ad1-0003

module purge
module load hwloc
module load pmix
module load prun/1.3
module load gnu8/8.3.0
module load openmpi3/3.1.4
module load ohpc
module load Python/3.6.6-foss-2018b

set -eo pipefail -o nounset
source /etc/profile.d/lmod.sh

export folderdata=/mnt/shared/ELEM/data/scars-darrel-test

export foldertemplate=${folderdata}/data_in
export folderin=${folderdata}/data_in_${SLURM_JOB_ID}
export foldergeom=${folderdata}/geom_in
export folderout=${folderdata}/resu_${SLURM_JOB_ID}
export foldervtkgeom=${folderdata}/vtk-geom-definition

export probname=wedge_scars

export binalya=/mnt/shared/ELEM/bm-standard-e2-64-ad1-0001-cosas/mariano-exmedi-ohara-alya2/Executables/unix/Alya.g

mkdir -p ${folderin}
cp -r ${foldertemplate}/*  ${folderin}/.

echo '--|JOB STARTING AT: ' `date`
echo '--|   ALYA: STARTING AT: ' `date`
cd ${folderin}

#### I get the error after this:

time -p srun --mpi=pmix ${binalya} ${probname}

echo '--|   ALYA: FINISHED AT: ' `date`
echo '--|JOB FINISHED: ' `date`
@milliams milliams transferred this issue from clusterinthecloud/installer Jun 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants