PyTorch and OpenMP/MPI-enabled AMReX don't get along in `load_state_dict` #322

RTSandberg · 2024-05-21T19:58:46Z

On my local machine, PyTorch has some internal multithreaded functionality that doesn't get along with AMReX. Unless I set PyTorch.set_num_threads(1 or 2), then the attached script will hang when the neural network tries to set its initial parameters.

This script downloads some neural network parameters from Zenodo archive to then load them, and the load_state_dict function is the specific point of failure.

pytorch_amrex_hang_reproducer_v2.py.txt

The text was updated successfully, but these errors were encountered:

ax3l · 2024-05-22T19:03:48Z

Thank you, @RTSandberg !

For reproducibility, can you please add the OS you used, versions of Python, pyAMReX, PyTorch, MPI flavor and version, and mpi4py version?

ax3l · 2024-05-22T19:09:02Z

If we can reduce this problem to a pure mpi4py + PyTorch issue, then we could also report this upstream in PyTorch: https://github.com/pytorch/pytorch/issues

ax3l · 2025-01-13T19:23:21Z

The issue seems to be more general, also showing sometimes later, after model load: ECP-WarpX/impactx#773 (comment)

Could also be a mixing of OpenMP libraries (gomp and llvm omp)... Especially when downloading PyTorch from one source (say: Pip) and building the rest of the stack from another (system, conda, etc.).

One should check that using the same dependencies & compilers for ImpactX/AMReX and PyTorch does show the same issue, e.g., using exclusively Spack or Conda-Forge for everything.

RTSandberg mentioned this issue May 21, 2024

set num threads to avoid hanging ECP-WarpX/impactx#619

Merged

ax3l added component: third party Changes in ImpactX that reflect a change in a third-party library component: MPI Domain decomposition and communication bug Something isn't working bug: affects latest release Bug also exists in latest release version labels May 22, 2024

ax3l changed the title ~~PyTorch and mpi-enabled AMReX don't get along~~ PyTorch and MPI-enabled AMReX don't get along May 22, 2024

ax3l changed the title ~~PyTorch and MPI-enabled AMReX don't get along~~ PyTorch and MPI-enabled AMReX don't get along in load_state_dict May 22, 2024

ax3l mentioned this issue Jan 13, 2025

Error when executing "run_ml_surrogate_15_stage.py" ECP-WarpX/impactx#773

Open

ax3l added the backend: openmp Specific to OpenMP execution (CPUs) label Jan 13, 2025

ax3l changed the title ~~PyTorch and MPI-enabled AMReX don't get along in load_state_dict~~ PyTorch and OpenMP/MPI-enabled AMReX don't get along in load_state_dict Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch and OpenMP/MPI-enabled AMReX don't get along in `load_state_dict` #322

PyTorch and OpenMP/MPI-enabled AMReX don't get along in `load_state_dict` #322

RTSandberg commented May 21, 2024 •

edited by ax3l

Loading

ax3l commented May 22, 2024 •

edited

Loading

ax3l commented May 22, 2024

ax3l commented Jan 13, 2025 •

edited

Loading

PyTorch and OpenMP/MPI-enabled AMReX don't get along in load_state_dict #322

PyTorch and OpenMP/MPI-enabled AMReX don't get along in load_state_dict #322

Comments

RTSandberg commented May 21, 2024 • edited by ax3l Loading

ax3l commented May 22, 2024 • edited Loading

ax3l commented May 22, 2024

ax3l commented Jan 13, 2025 • edited Loading

PyTorch and OpenMP/MPI-enabled AMReX don't get along in `load_state_dict` #322

PyTorch and OpenMP/MPI-enabled AMReX don't get along in `load_state_dict` #322

RTSandberg commented May 21, 2024 •

edited by ax3l

Loading

ax3l commented May 22, 2024 •

edited

Loading

ax3l commented Jan 13, 2025 •

edited

Loading