-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Config files / running the example code #8
Comments
I'm also getting errors when trying to run the scripts in For instance, running this
from the
|
Hi there, could you try removing the |
Hi! Thanks for the comment. This seems to work for this step, but unfortunately I still run into some errrors.
although the file is definitely in the correct location.
which I haven't been able to debug. Could you please share which version of pytorch you tested the library on? Thanks! |
|
Based on our experience, it occurs when performing training on GTX 3090 GPU with cuda11.3+.
|
Hi, thanks for the reply.
Error executing job with overrides: ['++args.base_path=/home/ubuntu/PDEBench/pdebench/data/1D/Burgers/Train/', '++args.filename=1D_Burgers_Sols_Nu0.001.hdf5', '++args.model_name=FNO']
Traceback (most recent call last):
File "/home/ubuntu/PDEBench/pdebench/models/train_models_forward.py", line 169, in main
run_training_FNO(
File "/home/ubuntu/PDEBench/pdebench/models/fno/train.py", line 194, in run_training
for xx, yy, grid in train_loader:
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
data = self._next_data()
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1376, in _next_data
return self._process_data(data)
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1402, in _process_data
data.reraise()
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/torch/_utils.py", line 461, in reraise
raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/ubuntu/PDEBench/pdebench/models/fno/utils.py", line 360, in __getitem__
return self.data[idx,...,:self.initial_step,:], self.data[idx], self.grid
RuntimeError: CUDA error: initialization error
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1. In the config file I switched to Traceback (most recent call last):
File "/home/ubuntu/PDEBench/pdebench/models/train_models_forward.py", line 246, in <module>
main()
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/hydra/main.py", line 90, in decorated_main
_run_hydra(
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/hydra/_internal/utils.py", line 389, in _run_hydra
_run_app(
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/hydra/_internal/utils.py", line 452, in _run_app
run_and_report(
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/hydra/_internal/utils.py", line 296, in run_and_report
raise ex
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/hydra/_internal/utils.py", line 213, in run_and_report
return func()
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/hydra/_internal/utils.py", line 453, in <lambda>
lambda: hydra.run(
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "/home/ubuntu/PDEBench/pdebench/models/train_models_forward.py", line 169, in main
run_training_FNO(
File "/home/ubuntu/PDEBench/pdebench/models/fno/train.py", line 88, in run_training
train_data = FNODatasetMult(flnm,
File "/home/ubuntu/PDEBench/pdebench/models/fno/utils.py", line 387, in __init__
with h5py.File(self.file_path, 'r') as h5_file:
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/h5py/_hl/files.py", line 533, in __init__
fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
File "/opt/conda/envs/pdebench_cu116/lib/python3.9/site-packages/h5py/_hl/files.py", line 226, in make_fid
fid = h5f.open(name, flags, fapl=fapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 106, in h5py.h5f.open
FileNotFoundError: [Errno 2] Unable to open file (unable to open file: name = '/home/ubuntu/PDEBench/pdebench/data/1D/Burgers/Train/1D_Burgers_Sols_Nu0.001.hdf5.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0) Note the Here's the command I'm running and the config file. defaults:
- _self_
- override hydra/hydra_logging: disabled
- override hydra/job_logging: disabled
hydra:
output_subdir: null
run:
dir: .
args:
model_name: 'FNO'
if_training: True
continue_training: False
num_workers: 2
batch_size: 5
initial_step: 10
t_train: 101
model_update: 10
single_file: True
reduced_resolution: 1
reduced_resolution_t: 1
reduced_batch: 1
epochs: 500
learning_rate: 1.e-3
scheduler_step: 100
scheduler_gamma: 0.5
#Unet
in_channels: 2
out_channels: 2
ar_mode: True
pushforward: True
unroll_step: 20
#FNO
num_channels: 2
modes: 12
width: 20
#Inverse
training_type: autoregressive
#Inverse MCMC
mcmc_num_samples: 20
mcmc_warmup_steps: 10
mcmc_num_chains: 1
num_samples_max: 1000
in_channels_hid: 64
inverse_model_type: InitialConditionInterp
#Inverse grad
inverse_epochs: 100
inverse_learning_rate: 0.2
inverse_verbose_flag: False
#Plotting
plot: False
channel_plot: 0 # Which channel/variable to be plotted
x_min: -1
x_max: 1
y_min: -1
y_max: 1
t_min: 0
t_max: 5 |
Hi there, Yes, the single_file argument should be set to True for the Burgers dataset. I assume that line 184 in the fno/utils.py script would give an error since the file type is hdf5 and not h5. @mtakamoto-D Could you please provide the config arguments for the Burgers dataset and change the assert statement to accept also hdf5 file (not only h5)? |
@timothypraditia |
@mtakamoto-D Ah I see, then I think the assert command should be fine. |
@GeoffNN: Makoto will provide the config files as soon as possible. In the meantime, could you try to copy the arguments in the file config_ReacDiff.yaml to the arguments in the file config.yaml and then try to run it? |
Hi there. |
Hello, I haven't had the bandwidth recently to try; I'll keep you posted as soon as I do. |
Hi!
Great work, it's pretty great to have dataloaders set up for all these different PDE examples.
I've been struggling with running the example code, e.g. on Advection data, which is the default one which the data_download file gets. (Btw, the downloader doesn't respect the config file data paths, which is a bit confusing for a new user.)
I started by trying
CUDA_VISIBLE_DEVICES='2' python train_models_forward.py +args=config.yaml
and got the following error :
In 'config': Could not find 'args/config.yaml'
My understanding is that I had to change
config.yaml
withconfig_pinn_pde1d.yaml
, but I'm not entirely sure.Once I did so, and changed the
filename
androot_path
in my config, I got an error "local variable 'timedomain' referenced before assignment", which seems due to how the filename is parsed.PDEBench/pdebench/models/pinn/train.py
Line 217 in f8c8493
raise ValueError
around this line ; and perhaps to be more explicit in what filenames are allowed or not.After fixing this, I got a shape error
PDEBench/pdebench/models/pinn/utils.py
Line 354 in f8c8493
val_batch_idx
, the tensorself.data_output
becomes 1d. I'm surprised here that the tensor isn't 3D to start with (I checked, andh5_file['tensor']
is shape (201, 1024).Perhaps I downloaded the wrong files? I'm using the
data1D/Advection/Test/Advection_beta0.1.h5
file for now, thinking it would be the simplest.Here is the config that I used for downloading (I didn't modify it) :
Any help would be greatly welcome!!
The text was updated successfully, but these errors were encountered: