_pickle.UnpicklingError: invalid load key, '\x00'. #7

GuoJunfu-tech · 2024-06-27T09:10:51Z

Dear Author,

I encountered an error when running bash scripts/train_and_eval_w_geo.sh ManiGaussian_BC 0,1,2,3,4,5 5678 ${try_without_tmux} .
I run this code with 6 RTX3090s in Ubuntu20.04, torch==2.0.0+cu117.

However, it shows an error like this during training:

Error executing job with overrides: ['method=ManiGaussian_BC', 'rlbench.task_name=ManiGaussian_BC_20240627', 'rlbench.demo_path=/home/gjf/codes/ManiGaussian/data/train_data', 'replay.path=/home/gjf/codes/ManiGaussian/replay/ManiGaussian_BC_20240627', 'framework.start_seed=0', 'framework.use_wandb=False', 'method.use_wandb=False', 'framework.wandb_group=ManiGaussian_BC_20240627', 'framework.wandb_name=ManiGaussian_BC_20240627', 'ddp.num_devices=6', 'replay.batch_size=1', 'ddp.master_port=5678', 'rlbench.tasks=[close_jar,open_drawer,sweep_to_dustpan_of_size,meat_off_grill,turn_tap,slide_block_to_color_target,put_item_in_drawer,reach_and_drag,push_buttons,stack_blocks]', 'rlbench.demos=20', 'method.neural_renderer.render_freq=2000']
Traceback (most recent call last):
  File "/home/gjf/codes/ManiGaussian/train.py", line 96, in main
    run_seed_fn.run_seed(
  File "/home/gjf/codes/ManiGaussian/run_seed_fn.py", line 147, in run_seed
    train_runner.start()
  File "/home/gjf/codes/ManiGaussian/third_party/YARR/yarr/runners/offline_train_runner.py", line 200, in start
    batch = self.preprocess_data(data_iter)
  File "/home/gjf/codes/ManiGaussian/third_party/YARR/yarr/runners/offline_train_runner.py", line 121, in preprocess_data
    sampled_batch = next(data_iter) # may raise StopIteration
  File "/home/gjf/miniconda3/envs/manigaussian/lib/python3.9/site-packages/lightning/fabric/wrappers.py", line 178, in __iter__
    for item in self._dataloader:
  File "/home/gjf/miniconda3/envs/manigaussian/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 634, in __next__
    data = self._next_data()
  File "/home/gjf/miniconda3/envs/manigaussian/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 678, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/gjf/miniconda3/envs/manigaussian/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 41, in fetch
    data = next(self.dataset_iter)
  File "/home/gjf/codes/ManiGaussian/third_party/YARR/yarr/replay_buffer/wrappers/pytorch_replay_buffer.py", line 17, in _generator
    yield self._replay_buffer.sample_transition_batch(pack_in_dict=True)
  File "/home/gjf/codes/ManiGaussian/third_party/YARR/yarr/replay_buffer/uniform_replay_buffer.py", line 722, in sample_transition_batch
    store = self._get_from_disk(
  File "/home/gjf/codes/ManiGaussian/third_party/YARR/yarr/replay_buffer/uniform_replay_buffer.py", line 391, in _get_from_disk
    d = pickle.load(f)
_pickle.UnpicklingError: invalid load key, '\x00'.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
  0%|▎                                                                                                                                                                                                              | 134/100010 [03:10<35:31:44,  1.28s/it]/home/gjf/miniconda3/envs/manigaussian/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 11 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '`

After that, one of the GPU stopped work, the whole program stuck at this place even I pressed Ctrl + C. This happened every time soon after training.

By the way, I did not use tmux or wandb, would this matter?

Could you please help me with this issue?

The text was updated successfully, but these errors were encountered:

GuanxingLu · 2024-06-27T09:36:14Z

Yes, I've also encountered this issue multiple times, but I haven't found an essential solution yet because it seems to occur randomly. I suspect it could be due to loading a broken file that was removed by other processes, since the data is cached in a shared directory ('/tmp/arm/replay' by default). I recommend trying the training with two GPUs one more time.

GuoJunfu-tech · 2024-07-01T09:41:22Z

Thanks for replying, I have not faced the same error when using 2 cards.

GuanxingLu added bug Something isn't working help wanted Extra attention is needed labels Jun 27, 2024

GuanxingLu mentioned this issue Aug 7, 2024

Error on pickle.load() #17

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_pickle.UnpicklingError: invalid load key, '\x00'. #7

_pickle.UnpicklingError: invalid load key, '\x00'. #7

GuoJunfu-tech commented Jun 27, 2024

GuanxingLu commented Jun 27, 2024

GuoJunfu-tech commented Jul 1, 2024

_pickle.UnpicklingError: invalid load key, '\x00'. #7

_pickle.UnpicklingError: invalid load key, '\x00'. #7

Comments

GuoJunfu-tech commented Jun 27, 2024

GuanxingLu commented Jun 27, 2024

GuoJunfu-tech commented Jul 1, 2024