Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error: initialization error #38

Closed
amalss18 opened this issue Jun 20, 2023 · 3 comments
Closed

CUDA error: initialization error #38

amalss18 opened this issue Jun 20, 2023 · 3 comments

Comments

@amalss18
Copy link

Tried running python train_models_forward.py and I ran into the following error:

Error executing job with overrides: []
Traceback (most recent call last):
  File "/home/asebasti/PDEBench/pdebench/models/train_models_forward.py", line 250, in <module>
    main()
  File "/home/asebasti/environments/conda_envs/pde_env_2/lib/python3.10/site-packages/hydra/main.py", line 90, in decorated_main
    _run_hydra(
  File "/home/asebasti/environments/conda_envs/pde_env_2/lib/python3.10/site-packages/hydra/_internal/utils.py", line 389, in _run_hydra
    _run_app(
  File "/home/asebasti/environments/conda_envs/pde_env_2/lib/python3.10/site-packages/hydra/_internal/utils.py", line 452, in _run_app
    run_and_report(
  File "/home/asebasti/environments/conda_envs/pde_env_2/lib/python3.10/site-packages/hydra/_internal/utils.py", line 216, in run_and_report
    raise ex
  File "/home/asebasti/environments/conda_envs/pde_env_2/lib/python3.10/site-packages/hydra/_internal/utils.py", line 213, in run_and_report
    return func()
  File "/home/asebasti/environments/conda_envs/pde_env_2/lib/python3.10/site-packages/hydra/_internal/utils.py", line 453, in <lambda>
    lambda: hydra.run(
  File "/home/asebasti/environments/conda_envs/pde_env_2/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/home/asebasti/environments/conda_envs/pde_env_2/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/home/asebasti/environments/conda_envs/pde_env_2/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/asebasti/PDEBench/pdebench/models/train_models_forward.py", line 167, in main
    run_training_FNO(
  File "/home/asebasti/PDEBench/pdebench/models/fno/train.py", line 111, in run_training
    _, _data, _ = next(iter(val_loader))
  File "/home/asebasti/environments/conda_envs/pde_env_2/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
    data = self._next_data()
  File "/home/asebasti/environments/conda_envs/pde_env_2/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
    return self._process_data(data)
  File "/home/asebasti/environments/conda_envs/pde_env_2/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
    data.reraise()
  File "/home/asebasti/environments/conda_envs/pde_env_2/lib/python3.10/site-packages/torch/_utils.py", line 644, in reraise
    raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/asebasti/environments/conda_envs/pde_env_2/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/asebasti/environments/conda_envs/pde_env_2/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/asebasti/environments/conda_envs/pde_env_2/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/asebasti/PDEBench/pdebench/models/fno/utils.py", line 361, in __getitem__
    return self.data[idx,...,:self.initial_step,:], self.data[idx], self.grid
RuntimeError: CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Ran this on a machine with a Tesla K80 (CUDA version: 11.4) and another machine with an A100(CUDA version:11.6), and ran into the same error on both of them.

#8 Seems to have run into the same issue as well. I tried adding the generator keyword argument to the dataloaders as suggested in the thread but still run into the same error message.

@leiterrl
Copy link
Member

leiterrl commented Jun 20, 2023

Hi,

thanks for raising this issue.

Short answer: Remove/comment line 158 from train_models_forward.py which imports PINN training.

Apparently loading the DeepXDE module for PINN training does some funky stuff in the background and sets some PyTorch multiprocessing defaults which interfere with our training script.

I'll provide a fix shortly.

@leiterrl
Copy link
Member

Should be fixed in 331b96a

Please re-open if your problem persists.

@amalss18
Copy link
Author

This works! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants