Skip to content
This repository has been archived by the owner on May 11, 2021. It is now read-only.

Using MPI and OpenMP

Kengo TOMIDA edited this page Feb 27, 2016 · 33 revisions

Domain Decomposition: MeshBlock

For parallel simulations with MPI, the computing domain is decomposed into small units. In Athena++, this decomposition unit is called MeshBlock, and all the MeshBlocks have the same logical size (i.e., the number of cells). These MeshBlocks are stored on a tree structure, and have unique integer IDs numbered by Z-ordering.

The MeshBlock size is specified by <meshblock> parameters in an input file. The following example is decomposing a Mesh with 256^3 into MeshBlocks with 64^3 cells, resulting in 64 MeshBlocks. Obviously, the size of Mesh must be divisible by MeshBlocks.

<mesh>
nx1     =    256
...
nx2     =    256
...
nx3     =    256
...
<meshblock>
nx1     =    64
nx2     =    64
nx3     =    64

The data output for non-parallelized formats (e.g. VTK), one file is generated per MeshBlock regardless of the actual number of processes. We recommend the HDF5 output because it combines all the MeshBlocks and outputs only two files per output timestep. For details, see Outputs.

MPI Parallelization

Athena++ is parallelized using standard Message Passing Interface (MPI-2). Each MPI process owns one or more MeshBlocks. The number of MeshBlocks per process may differ, but of course the best load balance is achieved when the computation load is distributed evenly. For MHD, the computational cost of a process is proportional to the number of MeshBlocks, but Athena++ supports weighting based-on actual computational costs when additional physics causes load-inbalance (not available in the current release). To check the load balance, use -m [nproc] option before running the simulation.

> athena -i athinput.example -m 64

This can be done with a single process. This tells you how many MeshBlocks are assigned to each process (see also Static Mesh Refinement).

In the previous example, up to 64 processes can be launched. To start simulations, simply launch the code using the mpiexec/mpirun commands, etc..

> mpiexec -n 64 athena -i athinput.example

Please consult the document of your system for details.

OpenMP Parallelization

OpenMP is a standard shared-memory parallelization within a node. OpenMP parallelize calculations within each MeshBlock. To enable this, configure the code with -omp option and set num_threads in the <mesh> block in your input file. Also, you probably need to set environment parameter to specify the number of threads. Generally this is OMP_NUM_THREADS, but please check the document of your system.

OpenMP parallelization is not very scalable. Usually you will get the best performance with 2 or 4 threads per process. Because these threads can share some data, especially the MeshBlock tree, it saves some memory. When you are running gigantic parallel simulations, this will be helpful.

Note on Performance

Generally speaking, larger MeshBlocks are better for performance, but it is a matter of balance between performance and time to get the solution. Using Haswell Xeon E5-2690v3 and Flat MPI parallelization using 24 processes / node, Athena++ can achieve about 7x10^5 cells per second per process for MHD (almost twice for hydrodynamics) and its weak-scaling is almost perfect when 64^3 cells per process are used. In other words, one timestep takes less than 0.4 second in this situation. If the performance is significantly lower than these values, it means something is wrong.

Generally, we recommend to use MeshBlocks with at least 32^3 cells, preferably 64^3 cells per process (thread). For Adaptive Mesh Refinement, smaller MeshBlocks like 16^3 are useful allowing more flexibility, but if it is too small the overhead will become significant. However, these numbers depend on computers. Therefore we recommend you to test the performance of your system before you start production runs.

Clone this wiki locally