-
Notifications
You must be signed in to change notification settings - Fork 365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
amrex.omp_threads
: Can Avoid SMT
#3607
Conversation
72782c5
to
cfce2e7
Compare
std::set<std::vector<int>> uniqueThreadSets; | ||
int cpuIndex = 0; | ||
|
||
while (true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be simpler to grep (using <regex>
) /proc/cpuinfo
.
$ cat /proc/cpuinfo | grep "processor"
processor : 0
processor : 1
processor : 2
processor : 3
processor : 4
processor : 5
processor : 6
processor : 7
$ cat /proc/cpuinfo | grep "core"
core id : 0
cpu cores : 4
core id : 1
cpu cores : 4
core id : 2
cpu cores : 4
core id : 3
cpu cores : 4
core id : 0
cpu cores : 4
core id : 1
cpu cores : 4
core id : 2
cpu cores : 4
core id : 3
cpu cores : 4
I have 4 physical cored and 8 threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cpu cores
could be the right one.
Here is a
model name : 12th Gen Intel(R) Core(TM) i9-12900H
$ cat /proc/cpuinfo | grep "processor"
processor : 0
processor : 1
processor : 2
processor : 3
processor : 4
processor : 5
processor : 6
processor : 7
processor : 8
processor : 9
processor : 10
processor : 11
processor : 12
processor : 13
processor : 14
processor : 15
processor : 16
processor : 17
processor : 18
processor : 19
axel@axel-dell:~$ cat /proc/cpuinfo | grep "core"
core id : 0
cpu cores : 14
core id : 0
cpu cores : 14
core id : 4
cpu cores : 14
core id : 4
cpu cores : 14
core id : 8
cpu cores : 14
core id : 8
cpu cores : 14
core id : 12
cpu cores : 14
core id : 12
cpu cores : 14
core id : 16
cpu cores : 14
core id : 16
cpu cores : 14
core id : 20
cpu cores : 14
core id : 20
cpu cores : 14
core id : 24
cpu cores : 14
core id : 25
cpu cores : 14
core id : 26
cpu cores : 14
core id : 27
cpu cores : 14
core id : 28
cpu cores : 14
core id : 29
cpu cores : 14
core id : 30
cpu cores : 14
core id : 31
cpu cores : 14
Effectively 14 physical cores. The first 12 logical cores use 2x SMT (6 physical). The next 8 are 1x SMT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, does not work on RHEL (Lassen, POWER9, altivec supported, 4x SMT):
$ cat /proc/cpuinfo | grep "processor"
processor : 0
processor : 1
processor : 2
processor : 3
processor : 4
processor : 5
processor : 6
processor : 7
processor : 8
processor : 9
processor : 10
processor : 11
processor : 12
processor : 13
processor : 14
processor : 15
processor : 16
processor : 17
processor : 18
processor : 19
processor : 20
processor : 21
processor : 22
processor : 23
processor : 24
processor : 25
processor : 26
processor : 27
processor : 28
processor : 29
processor : 30
processor : 31
processor : 32
processor : 33
processor : 34
processor : 35
processor : 36
processor : 37
processor : 38
processor : 39
processor : 40
processor : 41
processor : 42
processor : 43
processor : 44
processor : 45
processor : 46
processor : 47
processor : 48
processor : 49
processor : 50
processor : 51
processor : 52
processor : 53
processor : 54
processor : 55
processor : 56
processor : 57
processor : 58
processor : 59
processor : 60
processor : 61
processor : 62
processor : 63
processor : 64
processor : 65
processor : 66
processor : 67
processor : 68
processor : 69
processor : 70
processor : 71
processor : 72
processor : 73
processor : 74
processor : 75
processor : 76
processor : 77
processor : 78
processor : 79
processor : 80
processor : 81
processor : 82
processor : 83
processor : 84
processor : 85
processor : 86
processor : 87
processor : 88
processor : 89
processor : 90
processor : 91
processor : 92
processor : 93
processor : 94
processor : 95
processor : 96
processor : 97
processor : 98
processor : 99
processor : 100
processor : 101
processor : 102
processor : 103
processor : 104
processor : 105
processor : 106
processor : 107
processor : 108
processor : 109
processor : 110
processor : 111
processor : 112
processor : 113
processor : 114
processor : 115
processor : 116
processor : 117
processor : 118
processor : 119
processor : 120
processor : 121
processor : 122
processor : 123
processor : 124
processor : 125
processor : 126
processor : 127
[huebl1@lassen708:~]$ cat /proc/cpuinfo | grep "core"
<empty>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue for the Power9 on Summit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation works with the Linux 4 and 5 kernels I tested.
I am not sure what defines the /proc/cpuinfo
output (OS/distribution or kernel), but it looks very diverse for the systems I tested and does not have the info we need in at least the Linux 4 kernel-based systems I tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't realize /proc/cpuinfo
are so different on different machines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had no idea either, today I learned...
cfce2e7
to
809362f
Compare
In all our applications in BLAST, the OpenMP default to use all logical cores on modern CPUs results in significantly slower performance than just using the physical cores with AMReX. Thus, we introduce a new option `amrex.omp_threads` that enables control over the OpenMP threads at startup and has - for most popular systems - an implementation to find out the actual number of physical threads and default to it.
9cdf73c
to
ef16a75
Compare
@WeiqunZhang tests completed and documentation added :) I would roll this without a default change, ship it 1-2 months in WarpX and ImpactX with the |
@WeiqunZhang thank you for finalizing the PR with your commits, looks great 👍 |
// default or OMP_NUM_THREADS environment variable | ||
} else if (omp_threads == "nosmt") { | ||
char const *env_omp_num_threads = std::getenv("OMP_NUM_THREADS"); | ||
if (env_omp_num_threads != nullptr && amrex::system::verbose > 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooopsi, bug fix in #3647
## Summary Fix that `OMP_NUM_THREADS` was ignored in non-verbose runs. ## Additional background Follow-up to #3607 ## Checklist The proposed changes: - [x] fix a bug or incorrect behavior in AMReX - [ ] add new capabilities to AMReX - [ ] changes answers in the test suite to more than roundoff level - [ ] are likely to significantly affect the results of downstream AMReX users - [ ] include documentation in the code and/or rst files, if appropriate
## Summary In all our applications in BLAST, the OpenMP default to use all [logical cores on modern CPUs](https://en.wikipedia.org/wiki/Simultaneous_multithreading) results in significantly slower performance than just using the physical cores with AMReX. Thus, we introduce a new option `amrex.omp_threads` that enables control over the OpenMP threads at startup and has - for most popular systems - an implementation to find out the actual number of physical threads and default to it. For codes, users that change the default to `amrex.omp_threads = nosmt`, the `OMP_NUM_THREADS` variable will still take precedence. This is a bit unusual (because CLI options usually have higher precedence than env vars - and they do if the user provides a number here), but done intentionally: this way, codes like WarpX can set the `nosmt` default and HPC job scripts will set the exact, preferably benchmarked number of threads as usual without surprises. - [x] document ## Tests Performed for AMReX OMP Backend Tests were performed with very small examples, WarpX 3D LWFA test as checked in or AMReX AMRCore 3d test. - [x] Ubuntu 22.04 Laptop w/ 12th Gen Intel i9-12900H: @ax3l - 20 logical cores; the first 12 logical cores use 2x SMT/HT - 20 virtual (default) -> 14 physical (`amrex.omp_threads = nosmt`) - faster runtime! - [x] Perlmutter (SUSE Linux Enterprise 15.4, kernel 5.14.21) - [CPU node](https://docs.nersc.gov/systems/perlmutter/architecture/) with 2x [AMD EPYC 7763](https://www.amd.com/en/products/cpu/amd-epyc-7763) - 2x SMT - 256 default, 128 with `amrex.omp_threads = nosmt` - faster runtime! - [x] Frontier (SUSE Linux Enterprise 15.4, kernel 5.14.21) - 1x AMD EPYC 7763 64-Core Processor (w/ 2x SMT enabled) - 2x SMT - 128 default - 64 with `amrex.omp_threads = nosmt` - faster runtime! - The ideal result might also be lower, due to first cores used by OS and [low-noise cores](https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#low-noise-mode-layout) after that. But that is an orthogonal question and should be set in job scripts: `#SBATCH --ntasks-per-node=8` `#SBATCH --cpus-per-task=7` `#SBATCH --gpus-per-task=1` - [x] Summit (RHEL 8.2, kernel 4.18.0) - 2x IBM Power9 (each 22 physical cores each, each 6 disabled/hidden for OS?, 4x SMT enabled; cpuinfo says 128 total) - 4x SMT - 128 default, 32 with `amrex.omp_threads = nosmt` - faster runtime! - [x] [Lassen](https://hpc.llnl.gov/hardware/compute-platforms/lassen) (RHEL 7.9, kernel 4.14.0) - 2x IBM Power9 (each 22 physical cores, each 2 reserved for OS?, 4x SMT enabled) - 4x SMT - 160 default, 44 with `amrex.omp_threads = nosmt` - faster runtime! - The ideal result might be even down to 40, but that is an orthogonal question and should be set in job scripts. - [x] macOS M1 (arm64/aarch64) mini: - no SMT/HT - 8 default, 8 with `amrex.omp_threads = nosmt` - [x] macOS (OSX Ventura 13.5.2, 2.8 GHz Quad-Core Intel Core i7-8569U) Intel x86_64 @n01r - 2x SMT - 8 default, 4 with `amrex.omp_threads = nosmt` - faster runtime! - [x] macOS (OSX Ventura 13.5.2) M1 Max on mac studio @RTSandberg - no SMT/HT - 10 default, 10 with `amrex.omp_threads = nosmt` - [ ] some BSD/FreeBSD system? - no user requests - low priority, we just keep the default for now - [ ] Windows... looking for a system ## Additional background ## Checklist The proposed changes: - [ ] fix a bug or incorrect behavior in AMReX - [x] add new capabilities to AMReX - [ ] changes answers in the test suite to more than roundoff level - [ ] are likely to significantly affect the results of downstream AMReX users - [ ] include documentation in the code and/or rst files, if appropriate --------- Co-authored-by: Weiqun Zhang <[email protected]>
## Summary Fix that `OMP_NUM_THREADS` was ignored in non-verbose runs. ## Additional background Follow-up to AMReX-Codes#3607 ## Checklist The proposed changes: - [x] fix a bug or incorrect behavior in AMReX - [ ] add new capabilities to AMReX - [ ] changes answers in the test suite to more than roundoff level - [ ] are likely to significantly affect the results of downstream AMReX users - [ ] include documentation in the code and/or rst files, if appropriate
Summary
In all our applications in BLAST, the OpenMP default to use all logical cores on modern CPUs results in significantly slower performance than just using the physical cores with AMReX. Thus, we introduce a new option
amrex.omp_threads
that enables control over the OpenMP threads at startup and has - for most popular systems - an implementation to find out the actual number of physical threads and default to it.For codes, users that change the default to
amrex.omp_threads = nosmt
, theOMP_NUM_THREADS
variable will still take precedence. This is a bit unusual (because CLI options usually have higher precedence than env vars - and they do if the user provides a number here), but done intentionally: this way, codes like WarpX can set thenosmt
default and HPC job scripts will set the exact, preferably benchmarked number of threads as usual without surprises.Tests Performed for AMReX OMP Backend
Tests were performed with very small examples, WarpX 3D LWFA test as checked in or AMReX AMRCore 3d test.
amrex.omp_threads = nosmt
)amrex.omp_threads = nosmt
amrex.omp_threads = nosmt
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=7
#SBATCH --gpus-per-task=1
amrex.omp_threads = nosmt
amrex.omp_threads = nosmt
amrex.omp_threads = nosmt
amrex.omp_threads = nosmt
amrex.omp_threads = nosmt
Additional background
Checklist
The proposed changes: