You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I encountered an error when running bash scripts/train_and_eval_w_geo.sh ManiGaussian_BC 0,1,2,3,4,5 5678 ${try_without_tmux} .
I run this code with 6 RTX3090s in Ubuntu20.04, torch==2.0.0+cu117.
However, it shows an error like this during training:
Error executing job with overrides: ['method=ManiGaussian_BC', 'rlbench.task_name=ManiGaussian_BC_20240627', 'rlbench.demo_path=/home/gjf/codes/ManiGaussian/data/train_data', 'replay.path=/home/gjf/codes/ManiGaussian/replay/ManiGaussian_BC_20240627', 'framework.start_seed=0', 'framework.use_wandb=False', 'method.use_wandb=False', 'framework.wandb_group=ManiGaussian_BC_20240627', 'framework.wandb_name=ManiGaussian_BC_20240627', 'ddp.num_devices=6', 'replay.batch_size=1', 'ddp.master_port=5678', 'rlbench.tasks=[close_jar,open_drawer,sweep_to_dustpan_of_size,meat_off_grill,turn_tap,slide_block_to_color_target,put_item_in_drawer,reach_and_drag,push_buttons,stack_blocks]', 'rlbench.demos=20', 'method.neural_renderer.render_freq=2000']
Traceback (most recent call last):
File "/home/gjf/codes/ManiGaussian/train.py", line 96, in main
run_seed_fn.run_seed(
File "/home/gjf/codes/ManiGaussian/run_seed_fn.py", line 147, in run_seed
train_runner.start()
File "/home/gjf/codes/ManiGaussian/third_party/YARR/yarr/runners/offline_train_runner.py", line 200, in start
batch = self.preprocess_data(data_iter)
File "/home/gjf/codes/ManiGaussian/third_party/YARR/yarr/runners/offline_train_runner.py", line 121, in preprocess_data
sampled_batch = next(data_iter) # may raise StopIteration
File "/home/gjf/miniconda3/envs/manigaussian/lib/python3.9/site-packages/lightning/fabric/wrappers.py", line 178, in __iter__
for item in self._dataloader:
File "/home/gjf/miniconda3/envs/manigaussian/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 634, in __next__
data = self._next_data()
File "/home/gjf/miniconda3/envs/manigaussian/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 678, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/gjf/miniconda3/envs/manigaussian/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 41, in fetch
data = next(self.dataset_iter)
File "/home/gjf/codes/ManiGaussian/third_party/YARR/yarr/replay_buffer/wrappers/pytorch_replay_buffer.py", line 17, in _generator
yield self._replay_buffer.sample_transition_batch(pack_in_dict=True)
File "/home/gjf/codes/ManiGaussian/third_party/YARR/yarr/replay_buffer/uniform_replay_buffer.py", line 722, in sample_transition_batch
store = self._get_from_disk(
File "/home/gjf/codes/ManiGaussian/third_party/YARR/yarr/replay_buffer/uniform_replay_buffer.py", line 391, in _get_from_disk
d = pickle.load(f)
_pickle.UnpicklingError: invalid load key, '\x00'.
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
0%|▎ | 134/100010 [03:10<35:31:44, 1.28s/it]/home/gjf/miniconda3/envs/manigaussian/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 11 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '`
After that, one of the GPU stopped work, the whole program stuck at this place even I pressed Ctrl + C. This happened every time soon after training.
By the way, I did not use tmux or wandb, would this matter?
Could you please help me with this issue?
The text was updated successfully, but these errors were encountered:
Yes, I've also encountered this issue multiple times, but I haven't found an essential solution yet because it seems to occur randomly. I suspect it could be due to loading a broken file that was removed by other processes, since the data is cached in a shared directory ('/tmp/arm/replay' by default). I recommend trying the training with two GPUs one more time.
Dear Author,
I encountered an error when running
bash scripts/train_and_eval_w_geo.sh ManiGaussian_BC 0,1,2,3,4,5 5678 ${try_without_tmux}
.I run this code with 6 RTX3090s in Ubuntu20.04, torch==2.0.0+cu117.
However, it shows an error like this during training:
After that, one of the GPU stopped work, the whole program stuck at this place even I pressed
Ctrl + C
. This happened every time soon after training.By the way, I did not use tmux or wandb, would this matter?
Could you please help me with this issue?
The text was updated successfully, but these errors were encountered: