Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorch and OpenMP/MPI-enabled AMReX don't get along in load_state_dict #322

Open
RTSandberg opened this issue May 21, 2024 · 3 comments
Open
Labels
backend: openmp Specific to OpenMP execution (CPUs) bug: affects latest release Bug also exists in latest release version bug Something isn't working component: MPI Domain decomposition and communication component: third party Changes in ImpactX that reflect a change in a third-party library

Comments

@RTSandberg
Copy link
Member

RTSandberg commented May 21, 2024

On my local machine, PyTorch has some internal multithreaded functionality that doesn't get along with AMReX. Unless I set PyTorch.set_num_threads(1 or 2), then the attached script will hang when the neural network tries to set its initial parameters.

This script downloads some neural network parameters from Zenodo archive to then load them, and the load_state_dict function is the specific point of failure.

pytorch_amrex_hang_reproducer_v2.py.txt

@ax3l ax3l added component: third party Changes in ImpactX that reflect a change in a third-party library component: MPI Domain decomposition and communication bug Something isn't working bug: affects latest release Bug also exists in latest release version labels May 22, 2024
@ax3l
Copy link
Member

ax3l commented May 22, 2024

Thank you, @RTSandberg !

For reproducibility, can you please add the OS you used, versions of Python, pyAMReX, PyTorch, MPI flavor and version, and mpi4py version?

@ax3l ax3l changed the title PyTorch and mpi-enabled AMReX don't get along PyTorch and MPI-enabled AMReX don't get along May 22, 2024
@ax3l ax3l changed the title PyTorch and MPI-enabled AMReX don't get along PyTorch and MPI-enabled AMReX don't get along in load_state_dict May 22, 2024
@ax3l
Copy link
Member

ax3l commented May 22, 2024

If we can reduce this problem to a pure mpi4py + PyTorch issue, then we could also report this upstream in PyTorch: https://github.com/pytorch/pytorch/issues

@ax3l
Copy link
Member

ax3l commented Jan 13, 2025

The issue seems to be more general, also showing sometimes later, after model load: ECP-WarpX/impactx#773 (comment)

Could also be a mixing of OpenMP libraries (gomp and llvm omp)... Especially when downloading PyTorch from one source (say: Pip) and building the rest of the stack from another (system, conda, etc.).

One should check that using the same dependencies & compilers for ImpactX/AMReX and PyTorch does show the same issue, e.g., using exclusively Spack or Conda-Forge for everything.

@ax3l ax3l added the backend: openmp Specific to OpenMP execution (CPUs) label Jan 13, 2025
@ax3l ax3l changed the title PyTorch and MPI-enabled AMReX don't get along in load_state_dict PyTorch and OpenMP/MPI-enabled AMReX don't get along in load_state_dict Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend: openmp Specific to OpenMP execution (CPUs) bug: affects latest release Bug also exists in latest release version bug Something isn't working component: MPI Domain decomposition and communication component: third party Changes in ImpactX that reflect a change in a third-party library
Projects
None yet
Development

No branches or pull requests

2 participants