-
Notifications
You must be signed in to change notification settings - Fork 556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Difficulty Reproducing HalfCheetah-v2 SAC Results #128
Comments
Hi, thanks for pointing this out. One possible cause for this difference is that this implementation alternates between sampling entire trajectories and taking gradient steps, where as the original SAC paper alternates between one environment step and one gradient step. It's hard to compare the two exactly, but I'm guessing that something small like increase Another possible differences are differences in network initialization or very minor differences in the Adam optimizer implementation (I've seen people talk about this, though I don't particularly suspect this). |
@vitchyr Thanks very much, I will try increasing I don't see mention in the SAC paper of how the network's weights were initialized. I might look at the official implementation to see if it differs. |
Thanks for trying that. My main suspicion then is the difference between the batch data collection versus the intertwining data collection that could cause the difference. If you want to investigate this, replace the evaluation path collector with a step collector and replace the batch RL algorithm with an online RL algorithm. It might take a few more edits to get it to run, but these components should be fairly plug-and-play. |
Thanks again for your help @vitchyr. It looks like this issue in the soft learning repo is related: rail-berkeley/softlearning#75 However, I managed to get the same experiment running in soft learning and found the results matched those in the paper. Running this:
I got these results on four different seeds: These results match the paper's reported result achieving ~15,000 mean return on the first 3M timesteps. The evaluation mean return was >15k for all runs. Note that each of these runs took 10.7 hours. Compare to rlkit runs with four different values of num_trains_per_train_loop: Mean return on the first 3M tilmesteps ranges from 6,200-11,000. Due to the high values of num_trains_per_train_loop, these results also took longer to compute. The best performing one, with rlkit has more RL algorithms implemented and is better maintained, but for now I will continue with the tensorflow implementation since the baseline is immediately accessible. The sample and computational efficiency are important aspects for our work. |
Hii, where could I see the results, when I run "python3 examples/ddpg.py" ? I could not find the 'output' file. |
Huge thanks for providing this implementation, it's very high quality.
I'm having difficulty reproducing the results of the original SAC paper using the provided examples/sac.py script.
The paper reports a mean return of 15,000 in 3M steps (blue and orange lines are SAC):
My runs on the unmodified examples/sac.py script appear to be considerably less sample efficient:
My runs are pretty consistently achieving 13,000 average return on 10M steps. They may eventually get to 15,000 average return if left to run for millions of steps further, but my runs are requiring more than 3x the number of steps to achieve 13k vs 15k mean return.
I have found that results can vary greatly from run to run. Notice the pink line in my above chart that does poorly. Is the paper doing many runs and reporting the best? I didn't see this mentioned in the Experiments section of the paper.
It appears to me that the hyper parameters shown in the paper are the same in the script, which I have not modified:
Am I interpreting the "num total steps" and "Returns Mean" correctly? Do you know what might cause this difference in sample efficiency and final return?
The text was updated successfully, but these errors were encountered: