Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] CI: debug clang thread sanitizer errors #5492

Open
wants to merge 9 commits into
base: development
Choose a base branch
from

Conversation

EZoni
Copy link
Member

@EZoni EZoni commented Dec 3, 2024

Debug data race conditions raised by the Clang thread sanitizer CI job that was disabled in #5474.

@EZoni EZoni added the component: tests Tests and CI label Dec 3, 2024
@EZoni
Copy link
Member Author

EZoni commented Dec 4, 2024

Just copying one piece of information from the Clang documentation:

ThreadSanitizer is in beta stage. It is known to work on large C++ programs using pthreads, but we do not promise anything (yet). C++11 threading is supported with llvm libc++. The test suite is integrated into CMake build and can be run with make check-tsan command.

@EZoni
Copy link
Member Author

EZoni commented Dec 5, 2024

This is the summary of the race condition raised by the sanitizer, seemingly referring to a race condition on the AMReX end (line 105 of Src/Base/AMReX_Random.cpp):

SUMMARY: ThreadSanitizer: data race /home/runner/work/WarpX/WarpX/build/_deps/fetchedamrex-src/Src/Base/AMReX_Random.cpp:105:27 in amrex::InitRandom(unsigned long, int, unsigned long) (.omp_outlined_debug__)

Full log preceeding that summary message:

WARNING: ThreadSanitizer: data race (pid=9014)
  Read of size 8 at 0x7ffe53cae3e8 by thread T3:
    #0 amrex::InitRandom(unsigned long, int, unsigned long) (.omp_outlined_debug__) /home/runner/work/WarpX/WarpX/build/_deps/fetchedamrex-src/Src/Base/AMReX_Random.cpp:105:27 (warpx.1d.MPI.OMP.DP.PDP.OPMD.FFT.QED.GENQEDTABLES+0x82877a) (BuildId: 559e42540025a3c6ccba75684192ae77b83a3c08)
    #1 amrex::InitRandom(unsigned long, int, unsigned long) (.omp_outlined) /home/runner/work/WarpX/WarpX/build/_deps/fetchedamrex-src/Src/Base/AMReX_Random.cpp:101:1 (warpx.1d.MPI.OMP.DP.PDP.OPMD.FFT.QED.GENQEDTABLES+0x82877a)
    #2 __kmp_invoke_microtask <null> (libomp.so.5+0xdddc2) (BuildId: 5bc060b5a52d9[35](https://github.com/ECP-WarpX/WarpX/actions/runs/12150575793/job/33883599265#step:6:36)eaa6f063c7bb0d9bd35e11f99)

  Previous write of size 8 at 0x7ffe53cae3e8 by main thread:
    #0 amrex::InitRandom(unsigned long, int, unsigned long) /home/runner/work/WarpX/WarpX/build/_deps/fetchedamrex-src/Src/Base/AMReX_Random.cpp (warpx.1d.MPI.OMP.DP.PDP.OPMD.FFT.QED.GENQEDTABLES+0x828641) (BuildId: 559e42540025a3c6ccba75684192ae77b83a3c08)
    #1 amrex::Initialize(int&, char**&, bool, ompi_communicator_t*, std::function<void ()> const&, std::ostream&, std::ostream&, void (*)(char const*)) /home/runner/work/WarpX/WarpX/build/_deps/fetchedamrex-src/Src/Base/AMReX.cpp:647:5 (warpx.1d.MPI.OMP.DP.PDP.OPMD.FFT.QED.GENQEDTABLES+0x7eebe9) (BuildId: 559e42540025a3c6ccba75684192ae77b83a3c08)
    #2 warpx::initialization::amrex_init(int&, char**&, bool) /home/runner/work/WarpX/WarpX/Source/Initialization/WarpXAMReXInit.cpp:116:16 (warpx.1d.MPI.OMP.DP.PDP.OPMD.FFT.QED.GENQEDTABLES+0x67096e) (BuildId: 559e42540025a3c6ccba75684192ae77b83a3c08)
    #3 warpx::initialization::initialize_external_libraries(int, char**) /home/runner/work/WarpX/WarpX/Source/Initialization/WarpXInit.cpp:20:5 (warpx.1d.MPI.OMP.DP.PDP.OPMD.FFT.QED.GENQEDTABLES+0x2d696f) (BuildId: 559e42540025a3c6ccba75684192ae77b83a3c08)
    #4 main /home/runner/work/WarpX/WarpX/Source/main.cpp:20:5 (warpx.1d.MPI.OMP.DP.PDP.OPMD.FFT.QED.GENQEDTABLES+0x1[39](https://github.com/ECP-WarpX/WarpX/actions/runs/12150575793/job/33883599265#step:6:40)6da) (BuildId: 559e42540025a3c6ccba75684192ae77b83a3c08)
    #0 amrex::InitRandom(unsigned long, int, unsigned long) (.omp_outlined_debug__) /home/runner/work/WarpX/WarpX/build/_deps/fetchedamrex-src/Src/Base/AMReX_Random.cpp:105:27 (warpx.1d.MPI.OMP.DP.PDP.OPMD.FFT.QED.GENQEDTABLES+0x82877a) (BuildId: 559e425[40](https://github.com/ECP-WarpX/WarpX/actions/runs/12150575793/job/33883599265#step:6:41)025a3c6ccba75684192ae77b83a3c08)
    #1 amrex::InitRandom(unsigned long, int, unsigned long) (.omp_outlined) /home/runner/work/WarpX/WarpX/build/_deps/fetchedamrex-src/Src/Base/AMReX_Random.cpp:101:1 (warpx.1d.MPI.OMP.DP.PDP.OPMD.FFT.QED.GENQEDTABLES+0x82877a)
    #2 __kmp_invoke_microtask <null> (libomp.so.5+0xdddc2) (BuildId: 5bc060b5a52d935eaa6f063c7bb0d9bd35e11f99)

  Previous write of size 8 at 0x7ffdff075bd8 by main thread:
    #0 amrex::InitRandom(unsigned long, int, unsigned long) /home/runner/work/WarpX/WarpX/build/_deps/fetchedamrex-src/Src/Base/AMReX_Random.cpp (warpx.1d.MPI.OMP.DP.PDP.OPMD.FFT.QED.GENQEDTABLES+0x8286[41](https://github.com/ECP-WarpX/WarpX/actions/runs/12150575793/job/33883599265#step:6:42)) (BuildId: 559e42540025a3c6ccba75684192ae77b83a3c08)
    #1 amrex::Initialize(int&, char**&, bool, ompi_communicator_t*, std::function<void ()> const&, std::ostream&, std::ostream&, void (*)(char const*)) /home/runner/work/WarpX/WarpX/build/_deps/fetchedamrex-src/Src/Base/AMReX.cpp:647:5 (warpx.1d.MPI.OMP.DP.PDP.OPMD.FFT.QED.GENQEDTABLES+0x7eebe9) (BuildId: 559e[42](https://github.com/ECP-WarpX/WarpX/actions/runs/12150575793/job/33883599265#step:6:43)540025a3c6ccba75684192ae77b83a3c08)
    #2 warpx::initialization::amrex_init(int&, char**&, bool) /home/runner/work/WarpX/WarpX/Source/Initialization/WarpXAMReXInit.cpp:116:16 (warpx.1d.MPI.OMP.DP.PDP.OPMD.FFT.QED.GENQEDTABLES+0x67096e) (BuildId: 559e42540025a3c6ccba75684192ae77b83a3c08)
    #3 warpx::initialization::initialize_external_libraries(int, char**) /home/runner/work/WarpX/WarpX/Source/Initialization/WarpXInit.cpp:20:5 (warpx.1d.MPI.OMP.DP.PDP.OPMD.FFT.QED.GENQEDTABLES+0x2d696f) (BuildId: 559e42540025a3c6ccba75684192ae77b83a3c08)
    #4 main /home/runner/work/WarpX/WarpX/Source/main.cpp:20:5 (warpx.1d.MPI.OMP.DP.PDP.OPMD.FFT.QED.GENQEDTABLES+0x1396da) (BuildId: 559e42540025a3c6ccba75684192ae77b83a3c08)


  Location is stack of main thread.

  Location is global '??' at 0x7ffe53c90000 ([stack]+0x1e3e8)

  Thread T3 (tid=9024, running) created by main thread at:
  Location is stack of main thread.

  Location is global '??' at 0x7ffdff057000 ([stack]+0x1ebd8)

  Thread T3 (tid=9023, running) created by main thread at:
    #0 pthread_create <null> (warpx.1d.MPI.OMP.DP.PDP.OPMD.FFT.QED.GENQEDTABLES+0xaf52f) (BuildId: 559e42540025a3c6ccba75684192ae77b83a3c08)
    #0 pthread_create <null> (warpx.1d.MPI.OMP.DP.PDP.OPMD.FFT.QED.GENQEDTABLES+0xaf52f) (BuildId: 559e42540025a3c6ccba75684192ae77b83a3c08)
    #1 <null> <null> (libomp.so.5+0xb587a) (BuildId: 5bc060b5a52d935eaa6f063c7bb0d9bd35e11f99)
    #1 <null> <null> (libomp.so.5+0xb587a) (BuildId: 5bc060b5a52d935eaa6f063c7bb0d9bd35e11f99)
    #2 amrex::Initialize(int&, char**&, bool, ompi_communicator_t*, std::function<void ()> const&, std::ostream&, std::ostream&, void (*)(char const*)) /home/runner/work/WarpX/WarpX/build/_deps/fetchedamrex-src/Src/Base/AMReX.cpp:642:5 (warpx.1d.MPI.OMP.DP.PDP.OPMD.FFT.QED.GENQEDTABLES+0x7eeba2) (BuildId: 559e42540025a3c6ccba75684192ae77b83a3c08)
    #2 amrex::Initialize(int&, char**&, bool, ompi_communicator_t*, std::function<void ()> const&, std::ostream&, std::ostream&, void (*)(char const*)) /home/runner/work/WarpX/WarpX/build/_deps/fetchedamrex-src/Src/Base/AMReX.cpp:642:5 (warpx.1d.MPI.OMP.DP.PDP.OPMD.FFT.QED.GENQEDTABLES+0x7eeba2) (BuildId: 559e42540025a3c6ccba75684192ae77b83a3c08)
    #3 warpx::initialization::amrex_init(int&, char**&, bool) /home/runner/work/WarpX/WarpX/Source/Initialization/WarpXAMReXInit.cpp:116:16 (warpx.1d.MPI.OMP.DP.PDP.OPMD.FFT.QED.GENQEDTABLES+0x67096e) (BuildId: 559e42540025a3c6ccba75684192ae77b83a3c08)
    #3 warpx::initialization::amrex_init(int&, char**&, bool) /home/runner/work/WarpX/WarpX/Source/Initialization/WarpXAMReXInit.cpp:116:16 (warpx.1d.MPI.OMP.DP.PDP.OPMD.FFT.QED.GENQEDTABLES+0x67096e) (BuildId: 559e42[54](https://github.com/ECP-WarpX/WarpX/actions/runs/12150575793/job/33883599265#step:6:55)0025a3c6ccba75684192ae77b83a3c08)
    #4 warpx::initialization::initialize_external_libraries(int, char**) /home/runner/work/WarpX/WarpX/Source/Initialization/WarpXInit.cpp:20:5 (warpx.1d.MPI.OMP.DP.PDP.OPMD.FFT.QED.GENQEDTABLES+0x2d696f) (BuildId: [55](https://github.com/ECP-WarpX/WarpX/actions/runs/12150575793/job/33883599265#step:6:56)9e42540025a3c6ccba7[56](https://github.com/ECP-WarpX/WarpX/actions/runs/12150575793/job/33883599265#step:6:57)84192ae77b83a3c08)
    #4 warpx::initialization::initialize_external_libraries(int, char**) /home/runner/work/WarpX/WarpX/Source/Initialization/WarpXInit.cpp:20:5 (warpx.1d.MPI.OMP.DP.PDP.OPMD.FFT.QED.GENQEDTABLES+0x2d696f) (BuildId: 5[59](https://github.com/ECP-WarpX/WarpX/actions/runs/12150575793/job/33883599265#step:6:60)e42540025a3c6ccba75684192ae77b83a3c08)
    #5 main /home/runner/work/WarpX/WarpX/Source/main.cpp:20:5 (warpx.1d.MPI.OMP.DP.PDP.OPMD.FFT.QED.GENQEDTABLES+0x1396da) (BuildId: 559e42540025a3c6ccba75684192ae77b83a3c08)

    #5 main /home/runner/work/WarpX/WarpX/Source/main.cpp:20:5 (warpx.1d.MPI.OMP.DP.PDP.OPMD.FFT.QED.GENQEDTABLES+0x1396da) (BuildId: 559e42540025a3c6ccba75684192ae77b83a3c08)

@EZoni EZoni force-pushed the ci_clang_thread_sanitizer branch from c46c5fd to 93c6d8e Compare December 6, 2024 21:59
@EZoni
Copy link
Member Author

EZoni commented Dec 6, 2024

@atmyers @WeiqunZhang

I tried running with clang-18 and clang-19 (the one used since the latest commit a49c934, after installing directly from LLVM), but I keep seeing the data race condition in both cases.

@ax3l
Copy link
Member

ax3l commented Jan 2, 2025

@EZoni this looks like a potential AMReX bug to me. Can you please compile with Debug symbols enabled and post the line number that is causing this inside amrex::initRandom in AMReX-Codes/amrex#4279 ?

Update: ah, I overlooked that you already had line numbers in your quoted parts - sorry for that.

@ax3l ax3l added bug Something isn't working component: third party Changes in WarpX that reflect a change in a third-party library bug: affects latest release Bug also exists in latest release version labels Jan 2, 2025
@ax3l
Copy link
Member

ax3l commented Jan 2, 2025

With debug symbols: job-logs.zip

2025-01-02T17:34:19.8402528Z WARNING: ThreadSanitizer: data race (pid=12273)
2025-01-02T17:34:19.8403069Z   Read of size 8 at 0x7ffcd86d71a8 by thread T3:
2025-01-02T17:34:19.8405169Z     #0 amrex::InitRandom(unsigned long, int, unsigned long) (.omp_outlined_debug__) /home/runner/work/WarpX/WarpX/build/_deps/fetchedamrex-src/Src/Base/AMReX_Random.cpp:105:27 (warpx.1d.MPI.OMP.DP.PDP.OPMD.FFT.QED.GENQEDTABLES.DEBUG+0x1417fcf) (BuildId: ff133839691af5ef2a0da05b6600258a1e2c1e48)
2025-01-02T17:34:19.8408669Z     #1 amrex::InitRandom(unsigned long, int, unsigned long) (.omp_outlined) /home/runner/work/WarpX/WarpX/build/_deps/fetchedamrex-src/Src/Base/AMReX_Random.cpp:101:1 (warpx.1d.MPI.OMP.DP.PDP.OPMD.FFT.QED.GENQEDTABLES.DEBUG+0x14180b5) (BuildId: ff133839691af5ef2a0da05b6600258a1e2c1e48)

which is this line:
https://github.com/AMReX-Codes/amrex/blob/25.01/Src/Base/AMReX_Random.cpp#L105

@ax3l
Copy link
Member

ax3l commented Jan 2, 2025

Proposed fix in AMReX-Codes/amrex#4281

@EZoni

This comment was marked as outdated.

@EZoni EZoni force-pushed the ci_clang_thread_sanitizer branch from e58d193 to 07c2c75 Compare January 14, 2025 18:26
@EZoni EZoni force-pushed the ci_clang_thread_sanitizer branch from 07c2c75 to 17ad556 Compare January 14, 2025 19:04
@EZoni
Copy link
Member Author

EZoni commented Jan 17, 2025

I extracted in #5575 the changes related to the LLVM installation. Ideally, #5575 should be merged first and we should then continue investigating the issue here after rebasing.

@EZoni EZoni changed the title [WIP] CI: debug Clang thread sanitizer errors [WIP] CI: debug clang thread sanitizer errors Jan 18, 2025
ax3l pushed a commit that referenced this pull request Jan 18, 2025
I think this was a suggestion by @WeiqunZhang in the context of
debugging the clang sanitizer issue currently addressed in #5492. I'm
extracting all related changes from #5492 to implement and test the LLVM
installation separately here.

This effectively unifies the CI scripts to install clang dependencies
into one single script that reads the clang version number from the
command line.

I think all CI checks should pass here as a prerequisite for debugging
the clang sanitizer issue further in #5492.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug: affects latest release Bug also exists in latest release version bug Something isn't working component: tests Tests and CI component: third party Changes in WarpX that reflect a change in a third-party library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants