Error running PPO baseline in docker #151

ricard-inho · 2023-03-31T08:50:39Z

I've been trying to run the PPO baseline inside the 2023 docker image and I keep getting this error and I don't know how to solve it. Does anyone have a suggestion on what to try next? Please let me know if you need any more information.

bash scripts/objectnav_train.sh
+ python -u -m torch.distributed.launch --use_env --nproc_per_node 1 run.py --exp-config configs/ddppo_objectnav_v2_hm3d_stretch.yaml --run-type train
/opt/conda/envs/habitat/lib/python3.8/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
pybullet build time: Oct 28 2022 16:15:14
2023-03-31 17:38:47,190 Initializing dataset ObjectNav-v1
pybullet build time: Oct 28 2022 16:15:14
pybullet build time: Oct 28 2022 16:15:14
pybullet build time: Oct 28 2022 16:15:14
pybullet build time: Oct 28 2022 16:15:14
2023-03-31 17:38:50,613 Initializing dataset ObjectNav-v1
2023-03-31 17:38:50,624 Initializing dataset ObjectNav-v1
2023-03-31 17:38:50,653 Initializing dataset ObjectNav-v1
2023-03-31 17:38:50,660 Initializing dataset ObjectNav-v1
2023-03-31 17:39:56,641 initializing sim Sim-v0
2023-03-31 17:39:57,010 initializing sim Sim-v0
2023-03-31 17:39:59,058 initializing sim Sim-v0
2023-03-31 17:39:59,719 initializing sim Sim-v0
2023-03-31 17:40:03,053 Initializing task ObjectNav-v2
2023-03-31 17:40:03,148 Initializing task ObjectNav-v2
/opt/conda/envs/habitat/lib/python3.8/site-packages/gym/spaces/box.py:84: UserWarning: WARN: Box bound precision lowered by casting to float32
  logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
/opt/conda/envs/habitat/lib/python3.8/site-packages/gym/spaces/box.py:84: UserWarning: WARN: Box bound precision lowered by casting to float32
  logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
2023-03-31 17:40:04,247 Initializing task ObjectNav-v2
/opt/conda/envs/habitat/lib/python3.8/site-packages/gym/spaces/box.py:84: UserWarning: WARN: Box bound precision lowered by casting to float32
  logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
2023-03-31 17:40:13,148 Initializing task ObjectNav-v2
/opt/conda/envs/habitat/lib/python3.8/site-packages/gym/spaces/box.py:84: UserWarning: WARN: Box bound precision lowered by casting to float32
  logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
2023-03-31 17:40:13,405 Center cropping observation size of depth from (640, 480) to (256, 256)
2023-03-31 17:40:13,407 Center cropping observation size of rgb from (640, 480) to (256, 256)
2023-03-31 17:40:16,805 agent number of parameters: 12593001
Traceback (most recent call last):
  File "run.py", line 96, in <module>
    main()
  File "run.py", line 50, in main
    run_exp(**vars(args))
  File "run.py", line 92, in run_exp
    execute_exp(config, run_type)
  File "run.py", line 75, in execute_exp
    trainer.train()
  File "/opt/conda/envs/habitat/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/habitat-lab/habitat-baselines/habitat_baselines/rl/ppo/ppo_trainer.py", line 829, in train
    count_steps_delta += self._collect_environment_result(
  File "/habitat-lab/habitat-baselines/habitat_baselines/rl/ppo/ppo_trainer.py", line 497, in _collect_environment_result
    outputs = [
  File "/habitat-lab/habitat-baselines/habitat_baselines/rl/ppo/ppo_trainer.py", line 498, in <listcomp>
    self.envs.wait_step_at(index_env)
  File "/opt/conda/envs/habitat/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/habitat-lab/habitat-lab/habitat/core/vector_env.py", line 410, in wait_step_at
    return self._connection_read_fns[index_env]()
  File "/habitat-lab/habitat-lab/habitat/core/vector_env.py", line 108, in __call__
    res = self.read_fn()
  File "/habitat-lab/habitat-lab/habitat/utils/pickle5_multiprocessing.py", line 68, in recv
    buf = self.recv_bytes()
  File "/opt/conda/envs/habitat/lib/python3.8/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/opt/conda/envs/habitat/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/opt/conda/envs/habitat/lib/python3.8/multiprocessing/connection.py", line 383, in _recv
    raise EOFError

The text was updated successfully, but these errors were encountered:

ykarmesh · 2023-04-04T05:38:54Z

Can you add export HABITAT_ENV_DEBUG=1 in your script before the start of the python process. This will help the code print the full error. You can read more about it here

Make sure you remove this command after you have solved the error.

ricard-inho · 2023-04-06T05:59:41Z

@ykarmesh Thank you. Do you have any idea of why this is happening?

File "/habitat-lab/habitat-lab/habitat/core/vector_env.py", line 258, in _worker_env
    return self.env.step(action)
  File "/habitat-lab/habitat-lab/habitat/gym/gym_wrapper.py", line 238, in step
        obs, reward, done, info = self.env.step(action)
  File "/opt/conda/envs/habitat/lib/python3.8/site-packages/gym/core.py", line 280, in step
    assert self.action_space.contains(o, r, done, i = self.env.step(action)
AssertionError: Unvalid action [nan nan nan nan] for action space Box(-1.0, 1.0, (4,), float32)

ykarmesh · 2023-04-24T13:20:17Z

The policy is predicting NaN actions. This usually happens when:

one or more of the input/target to the model are NaN or
the training has diverged.

One of the most common reason to encounter this issue is that the reward received from the Env is NaN. The default reward for both ObjectNav and Instance ImageNav is dependent on the distance to goal measure. When the distance to goal goes to NaN, the reward also goes NaN. The distance to goal is calculated using the Navmesh generated based on the robot configuration. If there is a mismatch between the robot configuration used during episode generation and training, it is possible to encounter this issue.

There was a bug related to this issue in the configs present in this repository which I have solved in this commit. Can you try the new config and see if the error goes away.

Note: The configs in the Habitat-Lab repository already have the correct parameters, so if you were using config from that repository, my fix is not relevant.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error running PPO baseline in docker #151

Error running PPO baseline in docker #151

ricard-inho commented Mar 31, 2023

ykarmesh commented Apr 4, 2023

ricard-inho commented Apr 6, 2023

ykarmesh commented Apr 24, 2023 •

edited

Loading

Error running PPO baseline in docker #151

Error running PPO baseline in docker #151

Comments

ricard-inho commented Mar 31, 2023

ykarmesh commented Apr 4, 2023

ricard-inho commented Apr 6, 2023

ykarmesh commented Apr 24, 2023 • edited Loading

ykarmesh commented Apr 24, 2023 •

edited

Loading