-
Notifications
You must be signed in to change notification settings - Fork 0
04‐17‐2024 Weekly Tag Up
Joe Miceli edited this page Apr 17, 2024
·
2 revisions
- Joe
- Chi Hui
- Ran experiment with 50/30/20 ratios (experiment 22.5)
- Results did not improve (maybe were even worse than 40/40/20)
- Even after running for 20 rounds
- Still never see mean policy go below the constraint
- Results did not improve (maybe were even worse than 40/40/20)
- Ran experiment with 80/0/20 ratios (experiment 22.6)
- No actions from the "queue" policy in the dataset
- Expectation was that the mean policy would perform similarly to the speed threshold policy
- Results were worse than the 40/40/20 ratio
- Mean policy ended worse than where it began
- Added ability to show the offline rollouts for the dataset policies along with the current policy
- Offline rollouts are generated by evaluating the actions produced by each policy using the value functions learned during FQE
- Since the online returns are so bad, assume that this means that the value functions produced by FQE are far from the “true” value functions but how do we evaluate what the "true" value functions are?
- Started experiment with the previous 1.0/13.89 threshold definition (experiment 23)
- This model was originally trained in experiment 18.1 but needed to be retrained because we changed the size of observations to support training of the "average speed limit" model
- Understanding here is that the single objective model performs poorly in a traffic setting (it produces gridlock) but may still be useable in a batch setting
- To confirm this, we need to look at the mean policy actually executing in the SUMO environment after 10 rounds of training
- Review FQE
- Seems like the value functions G is producing the same value regardless of the action selected by the policy
- All policies are very similar when comparing offline returns for G1 and G2
- When we do offline rollouts with 100% of dataset, it should be same return as online rollout
- Take NoCo out (so just use raw dataset, without any constraint applied)
- Take out random actions (only actions from queue or speed overage policies)
- Run lines 4-7 with dataset policies
- Not learning a policy, just use the dataset policies
- Learn G1 first using speed overage then using queue (so learn two different G1 functions)
- Learn G2 first using speed overage then using queue (so learn two different G2 functions)
- Get 4 different R_G_1 values
- Get 4 different R_G_2 values
- Get online G1 return from queue policy and speed overage policy
- Get online G2 return from queue policy and speed overage policy
- Just generating single values here (not a whole plot because we're not doing it for multiple rounds)
- Meet next Friday (Joe traveling during week)