Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

amrex.omp_threads: Can Avoid SMT #3607

Merged
merged 7 commits into from
Nov 2, 2023

Conversation

ax3l
Copy link
Member

@ax3l ax3l commented Oct 24, 2023

Summary

In all our applications in BLAST, the OpenMP default to use all logical cores on modern CPUs results in significantly slower performance than just using the physical cores with AMReX. Thus, we introduce a new option amrex.omp_threads that enables control over the OpenMP threads at startup and has - for most popular systems - an implementation to find out the actual number of physical threads and default to it.

For codes, users that change the default to amrex.omp_threads = nosmt, the OMP_NUM_THREADS variable will still take precedence. This is a bit unusual (because CLI options usually have higher precedence than env vars - and they do if the user provides a number here), but done intentionally: this way, codes like WarpX can set the nosmt default and HPC job scripts will set the exact, preferably benchmarked number of threads as usual without surprises.

  • document

Tests Performed for AMReX OMP Backend

Tests were performed with very small examples, WarpX 3D LWFA test as checked in or AMReX AMRCore 3d test.

  • Ubuntu 22.04 Laptop w/ 12th Gen Intel i9-12900H: @ax3l
    • 20 logical cores; the first 12 logical cores use 2x SMT/HT
    • 20 virtual (default) -> 14 physical (amrex.omp_threads = nosmt)
      • faster runtime!
  • Perlmutter (SUSE Linux Enterprise 15.4, kernel 5.14.21)
    • CPU node with 2x AMD EPYC 7763
    • 2x SMT - 256 default, 128 with amrex.omp_threads = nosmt
      • faster runtime!
  • Frontier (SUSE Linux Enterprise 15.4, kernel 5.14.21)
    • 1x AMD EPYC 7763 64-Core Processor (w/ 2x SMT enabled)
    • 2x SMT - 128 default - 64 with amrex.omp_threads = nosmt
      • faster runtime!
    • The ideal result might also be lower, due to first cores used by OS and low-noise cores after that. But that is an orthogonal question and should be set in job scripts: #SBATCH --ntasks-per-node=8 #SBATCH --cpus-per-task=7 #SBATCH --gpus-per-task=1
  • Summit (RHEL 8.2, kernel 4.18.0)
    • 2x IBM Power9 (each 22 physical cores each, each 6 disabled/hidden for OS?, 4x SMT enabled; cpuinfo says 128 total)
    • 4x SMT - 128 default, 32 with amrex.omp_threads = nosmt
      • faster runtime!
  • Lassen (RHEL 7.9, kernel 4.14.0)
    • 2x IBM Power9 (each 22 physical cores, each 2 reserved for OS?, 4x SMT enabled)
    • 4x SMT - 160 default, 44 with amrex.omp_threads = nosmt
      • faster runtime!
    • The ideal result might be even down to 40, but that is an orthogonal question and should be set in job scripts.
  • macOS M1 (arm64/aarch64) mini:
    • no SMT/HT - 8 default, 8 with amrex.omp_threads = nosmt
  • macOS (OSX Ventura 13.5.2, 2.8 GHz Quad-Core Intel Core i7-8569U) Intel x86_64 @n01r
    • 2x SMT - 8 default, 4 with amrex.omp_threads = nosmt
    • faster runtime!
  • macOS (OSX Ventura 13.5.2) M1 Max on mac studio @RTSandberg
    • no SMT/HT - 10 default, 10 with amrex.omp_threads = nosmt
  • some BSD/FreeBSD system?
    • no user requests
    • low priority, we just keep the default for now
  • Windows... looking for a system

Additional background

Checklist

The proposed changes:

  • fix a bug or incorrect behavior in AMReX
  • add new capabilities to AMReX
  • changes answers in the test suite to more than roundoff level
  • are likely to significantly affect the results of downstream AMReX users
  • include documentation in the code and/or rst files, if appropriate

@ax3l ax3l force-pushed the openmp-avoid-smt branch 4 times, most recently from 72782c5 to cfce2e7 Compare October 24, 2023 07:32
std::set<std::vector<int>> uniqueThreadSets;
int cpuIndex = 0;

while (true) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be simpler to grep (using <regex>) /proc/cpuinfo.

$ cat /proc/cpuinfo | grep "processor"
processor       : 0
processor       : 1
processor       : 2
processor       : 3
processor       : 4
processor       : 5
processor       : 6
processor       : 7

$ cat /proc/cpuinfo | grep "core"
core id         : 0
cpu cores       : 4
core id         : 1
cpu cores       : 4
core id         : 2
cpu cores       : 4
core id         : 3
cpu cores       : 4
core id         : 0
cpu cores       : 4
core id         : 1
cpu cores       : 4
core id         : 2
cpu cores       : 4
core id         : 3
cpu cores       : 4

I have 4 physical cored and 8 threads.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpu cores could be the right one.

Here is a

model name	: 12th Gen Intel(R) Core(TM) i9-12900H
$ cat /proc/cpuinfo | grep "processor"
processor	: 0
processor	: 1
processor	: 2
processor	: 3
processor	: 4
processor	: 5
processor	: 6
processor	: 7
processor	: 8
processor	: 9
processor	: 10
processor	: 11
processor	: 12
processor	: 13
processor	: 14
processor	: 15
processor	: 16
processor	: 17
processor	: 18
processor	: 19
axel@axel-dell:~$ cat /proc/cpuinfo | grep "core"
core id		: 0
cpu cores	: 14
core id		: 0
cpu cores	: 14
core id		: 4
cpu cores	: 14
core id		: 4
cpu cores	: 14
core id		: 8
cpu cores	: 14
core id		: 8
cpu cores	: 14
core id		: 12
cpu cores	: 14
core id		: 12
cpu cores	: 14
core id		: 16
cpu cores	: 14
core id		: 16
cpu cores	: 14
core id		: 20
cpu cores	: 14
core id		: 20
cpu cores	: 14
core id		: 24
cpu cores	: 14
core id		: 25
cpu cores	: 14
core id		: 26
cpu cores	: 14
core id		: 27
cpu cores	: 14
core id		: 28
cpu cores	: 14
core id		: 29
cpu cores	: 14
core id		: 30
cpu cores	: 14
core id		: 31
cpu cores	: 14

Effectively 14 physical cores. The first 12 logical cores use 2x SMT (6 physical). The next 8 are 1x SMT.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, does not work on RHEL (Lassen, POWER9, altivec supported, 4x SMT):

$ cat /proc/cpuinfo | grep "processor"
processor	: 0
processor	: 1
processor	: 2
processor	: 3
processor	: 4
processor	: 5
processor	: 6
processor	: 7
processor	: 8
processor	: 9
processor	: 10
processor	: 11
processor	: 12
processor	: 13
processor	: 14
processor	: 15
processor	: 16
processor	: 17
processor	: 18
processor	: 19
processor	: 20
processor	: 21
processor	: 22
processor	: 23
processor	: 24
processor	: 25
processor	: 26
processor	: 27
processor	: 28
processor	: 29
processor	: 30
processor	: 31
processor	: 32
processor	: 33
processor	: 34
processor	: 35
processor	: 36
processor	: 37
processor	: 38
processor	: 39
processor	: 40
processor	: 41
processor	: 42
processor	: 43
processor	: 44
processor	: 45
processor	: 46
processor	: 47
processor	: 48
processor	: 49
processor	: 50
processor	: 51
processor	: 52
processor	: 53
processor	: 54
processor	: 55
processor	: 56
processor	: 57
processor	: 58
processor	: 59
processor	: 60
processor	: 61
processor	: 62
processor	: 63
processor	: 64
processor	: 65
processor	: 66
processor	: 67
processor	: 68
processor	: 69
processor	: 70
processor	: 71
processor	: 72
processor	: 73
processor	: 74
processor	: 75
processor	: 76
processor	: 77
processor	: 78
processor	: 79
processor	: 80
processor	: 81
processor	: 82
processor	: 83
processor	: 84
processor	: 85
processor	: 86
processor	: 87
processor	: 88
processor	: 89
processor	: 90
processor	: 91
processor	: 92
processor	: 93
processor	: 94
processor	: 95
processor	: 96
processor	: 97
processor	: 98
processor	: 99
processor	: 100
processor	: 101
processor	: 102
processor	: 103
processor	: 104
processor	: 105
processor	: 106
processor	: 107
processor	: 108
processor	: 109
processor	: 110
processor	: 111
processor	: 112
processor	: 113
processor	: 114
processor	: 115
processor	: 116
processor	: 117
processor	: 118
processor	: 119
processor	: 120
processor	: 121
processor	: 122
processor	: 123
processor	: 124
processor	: 125
processor	: 126
processor	: 127
[huebl1@lassen708:~]$ cat /proc/cpuinfo | grep "core"
<empty>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue for the Power9 on Summit.

Copy link
Member Author

@ax3l ax3l Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation works with the Linux 4 and 5 kernels I tested.

I am not sure what defines the /proc/cpuinfo output (OS/distribution or kernel), but it looks very diverse for the systems I tested and does not have the info we need in at least the Linux 4 kernel-based systems I tested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't realize /proc/cpuinfo are so different on different machines.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had no idea either, today I learned...

@ax3l ax3l force-pushed the openmp-avoid-smt branch from cfce2e7 to 809362f Compare October 24, 2023 16:00
Src/Base/AMReX_OpenMP.cpp Outdated Show resolved Hide resolved
@ax3l ax3l marked this pull request as ready for review October 28, 2023 06:10
ax3l added 2 commits October 27, 2023 23:11
In all our applications in BLAST, the OpenMP default to use all logical
cores on modern CPUs results in significantly slower performance than
just using the physical cores with AMReX. Thus, we introduce a new
option `amrex.omp_threads` that enables control over the OpenMP
threads at startup and has - for most popular systems - an implementation
to find out the actual number of physical threads and default to it.
@ax3l
Copy link
Member Author

ax3l commented Oct 28, 2023

@WeiqunZhang tests completed and documentation added :)

I would roll this without a default change, ship it 1-2 months in WarpX and ImpactX with the nosmt default, and if you like we can still break the default in AMReX afterwards to do the same?

Src/Base/Make.package Outdated Show resolved Hide resolved
@WeiqunZhang WeiqunZhang merged commit a7afcba into AMReX-Codes:development Nov 2, 2023
69 checks passed
@ax3l ax3l deleted the openmp-avoid-smt branch November 9, 2023 16:45
@ax3l
Copy link
Member Author

ax3l commented Nov 9, 2023

@WeiqunZhang thank you for finalizing the PR with your commits, looks great 👍

@ax3l ax3l mentioned this pull request Nov 29, 2023
5 tasks
// default or OMP_NUM_THREADS environment variable
} else if (omp_threads == "nosmt") {
char const *env_omp_num_threads = std::getenv("OMP_NUM_THREADS");
if (env_omp_num_threads != nullptr && amrex::system::verbose > 1) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooopsi, bug fix in #3647

WeiqunZhang pushed a commit that referenced this pull request Nov 29, 2023
## Summary

Fix that `OMP_NUM_THREADS` was ignored in non-verbose runs.

## Additional background

Follow-up to #3607

## Checklist

The proposed changes:
- [x] fix a bug or incorrect behavior in AMReX
- [ ] add new capabilities to AMReX
- [ ] changes answers in the test suite to more than roundoff level
- [ ] are likely to significantly affect the results of downstream AMReX
users
- [ ] include documentation in the code and/or rst files, if appropriate
guj pushed a commit to guj/amrex that referenced this pull request Dec 13, 2023
## Summary

In all our applications in BLAST, the OpenMP default to use all [logical
cores on modern
CPUs](https://en.wikipedia.org/wiki/Simultaneous_multithreading) results
in significantly slower performance than just using the physical cores
with AMReX. Thus, we introduce a new option `amrex.omp_threads` that
enables control over the OpenMP threads at startup and has - for most
popular systems - an implementation to find out the actual number of
physical threads and default to it.

For codes, users that change the default to `amrex.omp_threads = nosmt`,
the `OMP_NUM_THREADS` variable will still take precedence. This is a bit
unusual (because CLI options usually have higher precedence than env
vars - and they do if the user provides a number here), but done
intentionally: this way, codes like WarpX can set the `nosmt` default
and HPC job scripts will set the exact, preferably benchmarked number of
threads as usual without surprises.

- [x] document

## Tests Performed for AMReX OMP Backend

Tests were performed with very small examples, WarpX 3D LWFA test as
checked in or AMReX AMRCore 3d test.

- [x] Ubuntu 22.04 Laptop w/ 12th Gen Intel i9-12900H: @ax3l 
  - 20 logical cores; the first 12 logical cores use 2x SMT/HT 
  - 20 virtual (default) -> 14 physical (`amrex.omp_threads = nosmt`)
    - faster runtime!
- [x] Perlmutter (SUSE Linux Enterprise 15.4, kernel 5.14.21)
- [CPU node](https://docs.nersc.gov/systems/perlmutter/architecture/)
with 2x [AMD EPYC
7763](https://www.amd.com/en/products/cpu/amd-epyc-7763)
  - 2x SMT - 256 default, 128 with `amrex.omp_threads = nosmt`
    - faster runtime!
- [x] Frontier (SUSE Linux Enterprise 15.4, kernel 5.14.21)
  - 1x AMD EPYC 7763 64-Core Processor (w/ 2x SMT enabled)
  - 2x SMT - 128 default - 64 with `amrex.omp_threads = nosmt`
    - faster runtime!
- The ideal result might also be lower, due to first cores used by OS
and [low-noise
cores](https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#low-noise-mode-layout)
after that. But that is an orthogonal question and should be set in job
scripts: `#SBATCH --ntasks-per-node=8` `#SBATCH --cpus-per-task=7`
`#SBATCH --gpus-per-task=1`
- [x] Summit (RHEL 8.2, kernel 4.18.0)
- 2x IBM Power9 (each 22 physical cores each, each 6 disabled/hidden for
OS?, 4x SMT enabled; cpuinfo says 128 total)
  - 4x SMT - 128 default, 32 with `amrex.omp_threads = nosmt`
    - faster runtime!
- [x] [Lassen](https://hpc.llnl.gov/hardware/compute-platforms/lassen)
(RHEL 7.9, kernel 4.14.0)
- 2x IBM Power9 (each 22 physical cores, each 2 reserved for OS?, 4x SMT
enabled)
  - 4x SMT - 160 default, 44 with `amrex.omp_threads = nosmt`
    - faster runtime!
- The ideal result might be even down to 40, but that is an orthogonal
question and should be set in job scripts.
- [x] macOS M1 (arm64/aarch64) mini:
  - no SMT/HT - 8 default, 8 with `amrex.omp_threads = nosmt`
- [x] macOS (OSX Ventura 13.5.2, 2.8 GHz Quad-Core Intel Core i7-8569U)
Intel x86_64 @n01r
    - 2x SMT - 8 default, 4 with `amrex.omp_threads = nosmt`
    - faster runtime!
- [x] macOS (OSX Ventura 13.5.2) M1 Max on mac studio @RTSandberg 
  - no SMT/HT - 10 default, 10 with `amrex.omp_threads = nosmt`
- [ ] some BSD/FreeBSD system?
  - no user requests
  - low priority, we just keep the default for now
- [ ] Windows... looking for a system

## Additional background

## Checklist

The proposed changes:
- [ ] fix a bug or incorrect behavior in AMReX
- [x] add new capabilities to AMReX
- [ ] changes answers in the test suite to more than roundoff level
- [ ] are likely to significantly affect the results of downstream AMReX
users
- [ ] include documentation in the code and/or rst files, if appropriate

---------

Co-authored-by: Weiqun Zhang <[email protected]>
guj pushed a commit to guj/amrex that referenced this pull request Dec 13, 2023
## Summary

Fix that `OMP_NUM_THREADS` was ignored in non-verbose runs.

## Additional background

Follow-up to AMReX-Codes#3607

## Checklist

The proposed changes:
- [x] fix a bug or incorrect behavior in AMReX
- [ ] add new capabilities to AMReX
- [ ] changes answers in the test suite to more than roundoff level
- [ ] are likely to significantly affect the results of downstream AMReX
users
- [ ] include documentation in the code and/or rst files, if appropriate
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants