You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tried running python train_models_forward.py and I ran into the following error:
Error executing job with overrides: []
Traceback (most recent call last):
File "/home/asebasti/PDEBench/pdebench/models/train_models_forward.py", line 250, in <module>
main()
File "/home/asebasti/environments/conda_envs/pde_env_2/lib/python3.10/site-packages/hydra/main.py", line 90, in decorated_main
_run_hydra(
File "/home/asebasti/environments/conda_envs/pde_env_2/lib/python3.10/site-packages/hydra/_internal/utils.py", line 389, in _run_hydra
_run_app(
File "/home/asebasti/environments/conda_envs/pde_env_2/lib/python3.10/site-packages/hydra/_internal/utils.py", line 452, in _run_app
run_and_report(
File "/home/asebasti/environments/conda_envs/pde_env_2/lib/python3.10/site-packages/hydra/_internal/utils.py", line 216, in run_and_report
raise ex
File "/home/asebasti/environments/conda_envs/pde_env_2/lib/python3.10/site-packages/hydra/_internal/utils.py", line 213, in run_and_report
return func()
File "/home/asebasti/environments/conda_envs/pde_env_2/lib/python3.10/site-packages/hydra/_internal/utils.py", line 453, in <lambda>
lambda: hydra.run(
File "/home/asebasti/environments/conda_envs/pde_env_2/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File "/home/asebasti/environments/conda_envs/pde_env_2/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/home/asebasti/environments/conda_envs/pde_env_2/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "/home/asebasti/PDEBench/pdebench/models/train_models_forward.py", line 167, in main
run_training_FNO(
File "/home/asebasti/PDEBench/pdebench/models/fno/train.py", line 111, in run_training
_, _data, _ = next(iter(val_loader))
File "/home/asebasti/environments/conda_envs/pde_env_2/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
data = self._next_data()
File "/home/asebasti/environments/conda_envs/pde_env_2/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
return self._process_data(data)
File "/home/asebasti/environments/conda_envs/pde_env_2/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
data.reraise()
File "/home/asebasti/environments/conda_envs/pde_env_2/lib/python3.10/site-packages/torch/_utils.py", line 644, in reraise
raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/asebasti/environments/conda_envs/pde_env_2/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/home/asebasti/environments/conda_envs/pde_env_2/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/asebasti/environments/conda_envs/pde_env_2/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/asebasti/PDEBench/pdebench/models/fno/utils.py", line 361, in __getitem__
return self.data[idx,...,:self.initial_step,:], self.data[idx], self.grid
RuntimeError: CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Ran this on a machine with a Tesla K80 (CUDA version: 11.4) and another machine with an A100(CUDA version:11.6), and ran into the same error on both of them.
#8 Seems to have run into the same issue as well. I tried adding the generator keyword argument to the dataloaders as suggested in the thread but still run into the same error message.
The text was updated successfully, but these errors were encountered:
Short answer: Remove/comment line 158 from train_models_forward.py which imports PINN training.
Apparently loading the DeepXDE module for PINN training does some funky stuff in the background and sets some PyTorch multiprocessing defaults which interfere with our training script.
Tried running
python train_models_forward.py
and I ran into the following error:Ran this on a machine with a Tesla K80 (CUDA version: 11.4) and another machine with an A100(CUDA version:11.6), and ran into the same error on both of them.
#8 Seems to have run into the same issue as well. I tried adding the
generator
keyword argument to the dataloaders as suggested in the thread but still run into the same error message.The text was updated successfully, but these errors were encountered: