-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Examples folder cleanup vol. 45: Enhance/redo autoregressive action distribution example. #49967
[RLlib] Examples folder cleanup vol. 45: Enhance/redo autoregressive action distribution example. #49967
Conversation
…_redo_remove_rllib_models_rst Signed-off-by: sven1977 <[email protected]> # Conflicts: # doc/source/rllib/rllib-rlmodule.rst
Signed-off-by: sven1977 <[email protected]>
@@ -1,223 +0,0 @@ | |||
# @OldAPIStack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
old API stack -> remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Some general questions in regard to the simplification of the RLModule
examples. Otherwise small nits.
obs + a1. For example, if obs is -0.3 and a1 was sampled to be 1, then the value of | ||
the first reward component is: | ||
r1 = -abs(1.0 - [obs+a1]) = -abs(1.0 - (-0.3 + 1)) = -abs(0.3) = -0.3 | ||
The second reward component is computed as the negative absolute value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe better to keep the same values as used for the first reward component.
desired_a2 = ( | ||
self.state[0] * a1 | ||
) # Autoregressive relationship dependent on state | ||
# r1 depends on how well q1 is aligned with obs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
q1 -> a1 ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
@override(TorchRLModule) | ||
def get_inference_action_dist_cls(self): | ||
return TorchMultiDistribution.get_partial_dist_cls( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice example of how to use our MultiDistribution
classes. I wonder, however, why we do not use it above where the distributions are used for sampling? In training I guess exactly this distribution is used for the action logps, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still a very messy API:
from_logits
requires a single tensor, which then gets split into n sub-tensors through the help of another method 🤔ctor
however utilizes a nested structure of subcomponents, which should probably be the norm. I'm wondering, whether we can simplify the entireDistributions
API further by getting rid offrom_logits
altogether.- Another, even more radical thought would be to even get rid of
Distribution/TorchDistribution
and usetorch.Distribution
instead 😉.
But we should clean this up separately.
assert self.action_space[0].n == 3 | ||
assert isinstance(self.action_space[1], gym.spaces.Box) | ||
|
||
self._prior_net = nn.Sequential( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new definition of the autoregressive module simplifies the setup and is better readable, I admit. However, it completely blanks out how users could/should use the RLModule
attributes as well as the Catalog
that makes setting up networks simply by pre-configured hyparparameters. Given that most current examples define the RLModule directly without leveraging its specific attributes, methods, or the Catalog
- and recognizing that this often presents a challenge - we should evaluate our approach. Specifically, we should consider whether to (a) provide these same examples also in a form that explicitly demonstrate the use of these class attributes, methods, and the catalogue, or (b) streamline the RLModule
by simplifying it further and potentially removing the Catalog
entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By adding too many APIs to this example (Catalog) that have nothing to do with the actual problem to be solved here (autoregressive sampling of the 2 action components), we are cluttering this example.
I agree, we need to think about the usefulnes of Catalog
in general. My take has been for a while now to get rid of it and simply replace it with utility functions that the user may use, but doesn't have to. Most users don't have the problem of 1 Model -> 100 spaces (like RLlib does!), but have a relatively fixed space and thus also a relatively stable model encoder architecture that doesn't require a complex "catalog decision tree".
and a2 was sampled to be -0.7, then the value of the second reward component is: | ||
r2 = -abs(obs + a1 + a2) = -abs(0.5 + 0 - 0.7)) = -abs(-0.2) = -0.2 | ||
|
||
Because of this specific reward function, the agent must learn to optimally sample |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe describe in a sentence why this specific reward design supports the agent in learning correlated actions?
|
||
# Posterior forward pass. | ||
posterior_batch = torch.cat( | ||
[obs, one_hot(a1, self.action_space[0])], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I tried in multiple examples both, one-hot
and "raw" actions and could not see a difference. HOw many iterations does it need now to converge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think the existing example was not learning or just super slowly. The new example definitely learns both action components' dependency on 1) obs and 2) obs+[first action].
Signed-off-by: sven1977 <[email protected]>
Signed-off-by: sven1977 <[email protected]>
…action distribution example. (ray-project#49967)
…action distribution example. (ray-project#49967) Signed-off-by: Anson Qian <[email protected]>
Examples folder cleanup vol. 45: Enhance/redo autoregressive action distribution example.
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.