Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

amrex.omp_threads: Can Avoid SMT #3607

Merged
merged 7 commits into from
Nov 2, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions Docs/sphinx_documentation/source/InputsComputeBackends.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
.. _Chap:InputsComputeBackends:

Compute Backends
================

The following inputs must be preceded by ``amrex.`` and determine runtime options of CPU or GPU compute implementations.

+------------------------+-----------------------------------------------------------------------+-------------+------------+
| Parameter | Description | Type | Default |
+========================+=======================================================================+=============+============+
| ``omp_threads`` | If OpenMP is enabled, this can be used to set the default number of | String | ``system`` |
| | threads. The special value ``nosmt`` can be used to avoid using | or Int | |
| | threads for virtual cores (aka Hyperthreading or SMT), as is default | | |
| | in OpenMP, and instead only spawns threads equal to the number of | | |
| | physical cores in the system. | | |
| | For the values ``system`` and ``nosmt``, the environment variable | | |
| | ``OMP_NUM_THREADS`` takes precedence. For Integer values, | | |
| | ``OMP_NUM_THREADS`` is ignored. | | |
+------------------------+-----------------------------------------------------------------------+-------------+------------+

For GPU-specific parameters, see also the :ref:`GPU chapter <sec:gpu:parameters>`.
1 change: 1 addition & 0 deletions Docs/sphinx_documentation/source/Inputs_Chapter.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ Run-time Inputs
InputsProblemDefinition
InputsTimeStepping
InputsLoadBalancing
InputsComputeBackends
InputsPlotFiles
InputsCheckpoint

15 changes: 11 additions & 4 deletions Src/Base/AMReX.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@
#endif

#ifdef AMREX_USE_OMP
#include <AMReX_OpenMP.H>
#include <omp.h>
#endif

Expand All @@ -72,7 +73,9 @@
#include <iostream>
#include <iomanip>
#include <new>
#include <optional>
#include <stack>
#include <string>
#include <thread>
#include <limits>
#include <vector>
Expand Down Expand Up @@ -459,15 +462,17 @@ amrex::Initialize (int& argc, char**& argv, bool build_parm_parse,
#endif

#ifdef AMREX_USE_OMP
amrex::OpenMP::init_threads();

// status output
if (system::verbose > 0) {
// static_assert(_OPENMP >= 201107, "OpenMP >= 3.1 is required.");
amrex::Print() << "OMP initialized with "
<< omp_get_max_threads()
<< " OMP threads\n";
}
#endif

#if defined(AMREX_USE_MPI) && defined(AMREX_USE_OMP)
// warn if over-subscription is detected
if (system::verbose > 0) {
auto ncores = int(std::thread::hardware_concurrency());
if (ncores != 0 && // It might be zero according to the C++ standard.
Expand All @@ -476,8 +481,10 @@ amrex::Initialize (int& argc, char**& argv, bool build_parm_parse,
amrex::Print(amrex::ErrorStream())
<< "AMReX Warning: You might be oversubscribing CPU cores with OMP threads.\n"
<< " There are " << ncores << " cores per node.\n"
<< " There are " << ParallelDescriptor::NProcsPerNode() << " MPI ranks per node.\n"
<< " But OMP is initialized with " << omp_get_max_threads() << " threads per rank.\n"
#if defined(AMREX_USE_MPI)
<< " There are " << ParallelDescriptor::NProcsPerNode() << " MPI ranks (processes) per node.\n"
#endif
<< " But OMP is initialized with " << omp_get_max_threads() << " threads per process.\n"
<< " You should consider setting OMP_NUM_THREADS="
<< ncores/ParallelDescriptor::NProcsPerNode() << " or less in the environment.\n";
}
Expand Down
15 changes: 12 additions & 3 deletions Src/Base/AMReX_OpenMP.H
Original file line number Diff line number Diff line change
Expand Up @@ -11,20 +11,29 @@ namespace amrex::OpenMP {
inline int get_max_threads () { return omp_get_max_threads(); }
inline int get_thread_num () { return omp_get_thread_num(); }
inline int in_parallel () { return omp_in_parallel(); }
inline void set_num_threads (int num) { omp_set_num_threads(num); }

void init_threads ();
}

#else
#else // AMREX_USE_OMP

namespace amrex::OpenMP {

constexpr int get_num_threads () { return 1; }
constexpr int get_max_threads () { return 1; }
constexpr int get_thread_num () { return 0; }
constexpr int in_parallel () { return false; }

constexpr void set_num_threads (int) { /* nothing */ }
constexpr void init_threads () { /* nothing */ }
}

#endif
#endif // AMREX_USE_OMP

namespace amrex {
/** ... */
int
numUniquePhysicalCores();
}

#endif
175 changes: 175 additions & 0 deletions Src/Base/AMReX_OpenMP.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
#include <AMReX_OpenMP.H>
#include <AMReX.H>
#include <AMReX_ParmParse.H>
#include <AMReX_Print.H>

#if defined(__APPLE__)
#include <sys/types.h>
#include <sys/sysctl.h>
#endif

#if defined(_WIN32)
#include <windows.h>
#endif

#include <fstream>
#include <iostream>
#include <optional>
#include <set>
#include <sstream>
#include <string>
#include <thread>
#include <vector>


namespace amrex::OpenMP
{
int
numUniquePhysicalCores ()
WeiqunZhang marked this conversation as resolved.
Show resolved Hide resolved
{
int ncores;

#if defined(__APPLE__)
size_t len = sizeof(ncores);
// See hw.physicalcpu and hw.physicalcpu_max
// https://developer.apple.com/documentation/kernel/1387446-sysctlbyname/determining_system_capabilities/
// https://developer.apple.com/documentation/kernel/1387446-sysctlbyname
if (sysctlbyname("hw.physicalcpu", &ncores, &len, NULL, 0) == -1) {
if (system::verbose > 0) {
amrex::Print() << "numUniquePhysicalCores(): Error receiving hw.physicalcpu! "
<< "Defaulting to visible cores.\n";
}
ncores = int(std::thread::hardware_concurrency());
}
#elif defined(__linux__)
std::set<std::vector<int>> uniqueThreadSets;
int cpuIndex = 0;

while (true) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be simpler to grep (using <regex>) /proc/cpuinfo.

$ cat /proc/cpuinfo | grep "processor"
processor       : 0
processor       : 1
processor       : 2
processor       : 3
processor       : 4
processor       : 5
processor       : 6
processor       : 7

$ cat /proc/cpuinfo | grep "core"
core id         : 0
cpu cores       : 4
core id         : 1
cpu cores       : 4
core id         : 2
cpu cores       : 4
core id         : 3
cpu cores       : 4
core id         : 0
cpu cores       : 4
core id         : 1
cpu cores       : 4
core id         : 2
cpu cores       : 4
core id         : 3
cpu cores       : 4

I have 4 physical cored and 8 threads.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpu cores could be the right one.

Here is a

model name	: 12th Gen Intel(R) Core(TM) i9-12900H
$ cat /proc/cpuinfo | grep "processor"
processor	: 0
processor	: 1
processor	: 2
processor	: 3
processor	: 4
processor	: 5
processor	: 6
processor	: 7
processor	: 8
processor	: 9
processor	: 10
processor	: 11
processor	: 12
processor	: 13
processor	: 14
processor	: 15
processor	: 16
processor	: 17
processor	: 18
processor	: 19
axel@axel-dell:~$ cat /proc/cpuinfo | grep "core"
core id		: 0
cpu cores	: 14
core id		: 0
cpu cores	: 14
core id		: 4
cpu cores	: 14
core id		: 4
cpu cores	: 14
core id		: 8
cpu cores	: 14
core id		: 8
cpu cores	: 14
core id		: 12
cpu cores	: 14
core id		: 12
cpu cores	: 14
core id		: 16
cpu cores	: 14
core id		: 16
cpu cores	: 14
core id		: 20
cpu cores	: 14
core id		: 20
cpu cores	: 14
core id		: 24
cpu cores	: 14
core id		: 25
cpu cores	: 14
core id		: 26
cpu cores	: 14
core id		: 27
cpu cores	: 14
core id		: 28
cpu cores	: 14
core id		: 29
cpu cores	: 14
core id		: 30
cpu cores	: 14
core id		: 31
cpu cores	: 14

Effectively 14 physical cores. The first 12 logical cores use 2x SMT (6 physical). The next 8 are 1x SMT.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, does not work on RHEL (Lassen, POWER9, altivec supported, 4x SMT):

$ cat /proc/cpuinfo | grep "processor"
processor	: 0
processor	: 1
processor	: 2
processor	: 3
processor	: 4
processor	: 5
processor	: 6
processor	: 7
processor	: 8
processor	: 9
processor	: 10
processor	: 11
processor	: 12
processor	: 13
processor	: 14
processor	: 15
processor	: 16
processor	: 17
processor	: 18
processor	: 19
processor	: 20
processor	: 21
processor	: 22
processor	: 23
processor	: 24
processor	: 25
processor	: 26
processor	: 27
processor	: 28
processor	: 29
processor	: 30
processor	: 31
processor	: 32
processor	: 33
processor	: 34
processor	: 35
processor	: 36
processor	: 37
processor	: 38
processor	: 39
processor	: 40
processor	: 41
processor	: 42
processor	: 43
processor	: 44
processor	: 45
processor	: 46
processor	: 47
processor	: 48
processor	: 49
processor	: 50
processor	: 51
processor	: 52
processor	: 53
processor	: 54
processor	: 55
processor	: 56
processor	: 57
processor	: 58
processor	: 59
processor	: 60
processor	: 61
processor	: 62
processor	: 63
processor	: 64
processor	: 65
processor	: 66
processor	: 67
processor	: 68
processor	: 69
processor	: 70
processor	: 71
processor	: 72
processor	: 73
processor	: 74
processor	: 75
processor	: 76
processor	: 77
processor	: 78
processor	: 79
processor	: 80
processor	: 81
processor	: 82
processor	: 83
processor	: 84
processor	: 85
processor	: 86
processor	: 87
processor	: 88
processor	: 89
processor	: 90
processor	: 91
processor	: 92
processor	: 93
processor	: 94
processor	: 95
processor	: 96
processor	: 97
processor	: 98
processor	: 99
processor	: 100
processor	: 101
processor	: 102
processor	: 103
processor	: 104
processor	: 105
processor	: 106
processor	: 107
processor	: 108
processor	: 109
processor	: 110
processor	: 111
processor	: 112
processor	: 113
processor	: 114
processor	: 115
processor	: 116
processor	: 117
processor	: 118
processor	: 119
processor	: 120
processor	: 121
processor	: 122
processor	: 123
processor	: 124
processor	: 125
processor	: 126
processor	: 127
[huebl1@lassen708:~]$ cat /proc/cpuinfo | grep "core"
<empty>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue for the Power9 on Summit.

Copy link
Member Author

@ax3l ax3l Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation works with the Linux 4 and 5 kernels I tested.

I am not sure what defines the /proc/cpuinfo output (OS/distribution or kernel), but it looks very diverse for the systems I tested and does not have the info we need in at least the Linux 4 kernel-based systems I tested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't realize /proc/cpuinfo are so different on different machines.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had no idea either, today I learned...

// for each logical CPU in cpuIndex from 0...N-1
std::string path = "/sys/devices/system/cpu/cpu" + std::to_string(cpuIndex) + "/topology/thread_siblings_list";
std::ifstream file(path);
if (!file.is_open()) {
break; // no further CPUs to check
}

// find its siblings
std::vector<int> siblings;
std::string line;
if (std::getline(file, line)) {
std::stringstream ss(line);
std::string token;

// Possible syntax: 0-3, 8-11, 14,17
// https://github.com/torvalds/linux/blob/v6.5/Documentation/ABI/stable/sysfs-devices-system-cpu#L68-L72
while (std::getline(ss, token, ',')) {
size_t dashPos = token.find('-');
if (dashPos != std::string::npos) {
// Range detected
int start = std::stoi(token.substr(0, dashPos));
int end = std::stoi(token.substr(dashPos + 1));
for (int i = start; i <= end; ++i) {
siblings.push_back(i);
}
} else {
siblings.push_back(std::stoi(token));
}
}
}

// and record the siblings group
// (assumes: ascending and unique sets per cpuIndex)
uniqueThreadSets.insert(siblings);
cpuIndex++;
}

if (cpuIndex == 0) {
if (system::verbose > 0) {
amrex::Print() << "numUniquePhysicalCores(): Error reading CPU info.\n";
}
ncores = int(std::thread::hardware_concurrency());
} else {
ncores = int(uniqueThreadSets.size());
}
#elif defined(_WIN32)
DWORD length = 0;
bool result = GetLogicalProcessorInformation(NULL, &length);

if (!result) {
if (system::verbose > 0) {
amrex::Print() << "numUniquePhysicalCores(): Failed to get logical processor information! "
<< "Defaulting to visible cores.\n";
}
ncores = int(std::thread::hardware_concurrency());
}
else {
std::vector<SYSTEM_LOGICAL_PROCESSOR_INFORMATION> buffer(length / sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION));
if (!GetLogicalProcessorInformation(&buffer[0], &length)) {
std::cerr << "Failed to get logical processor information." << std::endl;
return -1;
WeiqunZhang marked this conversation as resolved.
Show resolved Hide resolved
WeiqunZhang marked this conversation as resolved.
Show resolved Hide resolved
if (system::verbose > 0) {
amrex::Print() << "numUniquePhysicalCores(): Failed to get logical processor information! "
<< "Defaulting to visible cores.\n";
}
ncores = int(std::thread::hardware_concurrency());
} else {
ncores = 0;
for (const auto& info : buffer) {
if (info.Relationship == RelationProcessorCore) {
ncores++;
}
}
}
}
#else
// TODO:
// BSD
if (system::verbose > 0) {
amrex::Print() << "numUniquePhysicalCores(): Unknown system. Defaulting to visible cores.\n";
}
ncores = int(std::thread::hardware_concurrency());
#endif
return ncores;
}

#ifdef AMREX_USE_OMP
void init_threads ()
{
amrex::ParmParse pp("amrex");
std::string omp_threads = "system";
pp.queryAdd("omp_threads", omp_threads);

auto to_int = [](std::string const & str_omp_threads) {
std::optional<int> num;
try { num = std::stoi(str_omp_threads); }
catch (...) { /* nothing */ }
return num;
};

if (omp_threads == "system") {
// default or OMP_NUM_THREADS environment variable
} else if (omp_threads == "nosmt") {
char const *env_omp_num_threads = std::getenv("OMP_NUM_THREADS");
if (env_omp_num_threads != nullptr && amrex::system::verbose > 1) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooopsi, bug fix in #3647

amrex::Print() << "amrex.omp_threads was set to nosmt,"
<< "but OMP_NUM_THREADS was set. Will keep "
<< "OMP_NUM_THREADS=" << env_omp_num_threads << ".\n";
} else {
omp_set_num_threads(numUniquePhysicalCores());
}
} else {
std::optional<int> num_omp_threads = to_int(omp_threads);
if (num_omp_threads.has_value()) {
omp_set_num_threads(num_omp_threads.value());
}
else {
if (amrex::system::verbose > 0) {
amrex::Print() << "amrex.omp_threads has an unknown value: "
<< omp_threads
<< " (try system, nosmt, or a positive integer)\n";
}
}
}
}
#endif // AMREX_USE_OMP
} // namespace amrex::OpenMP
1 change: 1 addition & 0 deletions Src/Base/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ foreach(D IN LISTS AMReX_SPACEDIM)
AMReX_ParallelDescriptor.H
AMReX_ParallelDescriptor.cpp
AMReX_OpenMP.H
AMReX_OpenMP.cpp
AMReX_ParallelReduce.H
AMReX_ForkJoin.H
AMReX_ForkJoin.cpp
Expand Down
2 changes: 1 addition & 1 deletion Src/Base/Make.package
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ C$(AMREX_BASE)_headers += AMReX_REAL.H AMReX_INT.H AMReX_CONSTANTS.H AMReX_SPACE

C$(AMREX_BASE)_sources += AMReX_DistributionMapping.cpp AMReX_ParallelDescriptor.cpp
C$(AMREX_BASE)_headers += AMReX_DistributionMapping.H AMReX_ParallelDescriptor.H
C$(AMREX_BASE)_headers += AMReX_OpenMP.H
C$(AMREX_BASE)_headers += AMReX_OpenMP.H AMReX_OpenMP.cpp
WeiqunZhang marked this conversation as resolved.
Show resolved Hide resolved

C$(AMREX_BASE)_headers += AMReX_ParallelReduce.H

Expand Down