You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi isp1tze, i really like your BicNet implementation! My goal is to run your BicNet implementation on an environment where every agent gets -1 reward for each time step it needs to finish the env. But there is a problem with your actor loss implementation, because the loss of the actor is defined as the prediction of the critic, the rewards needs to be zero if the agents performs perfect, isn't it?
Can you explain to me why you implemented it this way? Also, is there a possibility that the reward doesn't converges to 0 when the Agents performs good?
The text was updated successfully, but these errors were encountered:
Hi isp1tze, i really like your BicNet implementation! My goal is to run your BicNet implementation on an environment where every agent gets -1 reward for each time step it needs to finish the env. But there is a problem with your actor loss implementation, because the loss of the actor is defined as the prediction of the critic, the rewards needs to be zero if the agents performs perfect, isn't it?
loss_actor = -self.critic(state_batches, clear_action_batches).mean()
Can you explain to me why you implemented it this way? Also, is there a possibility that the reward doesn't converges to 0 when the Agents performs good?
The text was updated successfully, but these errors were encountered: