Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] Examples folder cleanup vol. 45: Enhance/redo autoregressive action distribution example. #49967

Conversation

sven1977
Copy link
Contributor

@sven1977 sven1977 commented Jan 20, 2025

Examples folder cleanup vol. 45: Enhance/redo autoregressive action distribution example.

  • Slightly changed the correlated-actions-env to make sure a1 is dependent on obs and a2 is dependent on both obs AND a1.
  • Simplified example RLModule by removing all catalog dependencies and the encoder.
  • Fixed bug in example RLModule: discrete a1 was not one-hot'd, which was probably the reason why learning was not working or very slow.
  • This PR is also in preparation of another upcoming doc PR that explains the various action distribution types available to users.

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

…_redo_remove_rllib_models_rst

Signed-off-by: sven1977 <[email protected]>

# Conflicts:
#	doc/source/rllib/rllib-rlmodule.rst
@sven1977 sven1977 added rllib RLlib related issues rllib-models An issue related to RLlib (default or custom) Models. rllib-docs-or-examples Issues related to RLlib documentation or rllib/examples rllib-newstack rllib-oldstack-cleanup Issues related to cleaning up classes, utilities on the old API stack labels Jan 20, 2025
Signed-off-by: sven1977 <[email protected]>
@@ -1,223 +0,0 @@
# @OldAPIStack
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

old API stack -> remove

@sven1977 sven1977 added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jan 21, 2025
Copy link
Collaborator

@simonsays1980 simonsays1980 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Some general questions in regard to the simplification of the RLModule examples. Otherwise small nits.

obs + a1. For example, if obs is -0.3 and a1 was sampled to be 1, then the value of
the first reward component is:
r1 = -abs(1.0 - [obs+a1]) = -abs(1.0 - (-0.3 + 1)) = -abs(0.3) = -0.3
The second reward component is computed as the negative absolute value
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better to keep the same values as used for the first reward component.

desired_a2 = (
self.state[0] * a1
) # Autoregressive relationship dependent on state
# r1 depends on how well q1 is aligned with obs:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q1 -> a1 ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


@override(TorchRLModule)
def get_inference_action_dist_cls(self):
return TorchMultiDistribution.get_partial_dist_cls(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice example of how to use our MultiDistribution classes. I wonder, however, why we do not use it above where the distributions are used for sampling? In training I guess exactly this distribution is used for the action logps, isn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still a very messy API:

  • from_logits requires a single tensor, which then gets split into n sub-tensors through the help of another method 🤔
  • ctor however utilizes a nested structure of subcomponents, which should probably be the norm. I'm wondering, whether we can simplify the entire Distributions API further by getting rid of from_logits altogether.
  • Another, even more radical thought would be to even get rid of Distribution/TorchDistribution and use torch.Distribution instead 😉.

But we should clean this up separately.

assert self.action_space[0].n == 3
assert isinstance(self.action_space[1], gym.spaces.Box)

self._prior_net = nn.Sequential(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new definition of the autoregressive module simplifies the setup and is better readable, I admit. However, it completely blanks out how users could/should use the RLModule attributes as well as the Catalog that makes setting up networks simply by pre-configured hyparparameters. Given that most current examples define the RLModule directly without leveraging its specific attributes, methods, or the Catalog - and recognizing that this often presents a challenge - we should evaluate our approach. Specifically, we should consider whether to (a) provide these same examples also in a form that explicitly demonstrate the use of these class attributes, methods, and the catalogue, or (b) streamline the RLModule by simplifying it further and potentially removing the Catalog entirely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By adding too many APIs to this example (Catalog) that have nothing to do with the actual problem to be solved here (autoregressive sampling of the 2 action components), we are cluttering this example.

I agree, we need to think about the usefulnes of Catalog in general. My take has been for a while now to get rid of it and simply replace it with utility functions that the user may use, but doesn't have to. Most users don't have the problem of 1 Model -> 100 spaces (like RLlib does!), but have a relatively fixed space and thus also a relatively stable model encoder architecture that doesn't require a complex "catalog decision tree".

and a2 was sampled to be -0.7, then the value of the second reward component is:
r2 = -abs(obs + a1 + a2) = -abs(0.5 + 0 - 0.7)) = -abs(-0.2) = -0.2

Because of this specific reward function, the agent must learn to optimally sample
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe describe in a sentence why this specific reward design supports the agent in learning correlated actions?


# Posterior forward pass.
posterior_batch = torch.cat(
[obs, one_hot(a1, self.action_space[0])],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I tried in multiple examples both, one-hot and "raw" actions and could not see a difference. HOw many iterations does it need now to converge?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think the existing example was not learning or just super slowly. The new example definitely learns both action components' dependency on 1) obs and 2) obs+[first action].

Signed-off-by: sven1977 <[email protected]>
@sven1977 sven1977 enabled auto-merge (squash) January 21, 2025 10:14
@github-actions github-actions bot disabled auto-merge January 21, 2025 10:14
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Jan 21, 2025
@sven1977 sven1977 enabled auto-merge (squash) January 21, 2025 11:15
Signed-off-by: sven1977 <[email protected]>
@github-actions github-actions bot disabled auto-merge January 21, 2025 11:54
@sven1977 sven1977 enabled auto-merge (squash) January 21, 2025 13:08
@sven1977 sven1977 merged commit a918028 into ray-project:master Jan 21, 2025
6 checks passed
@sven1977 sven1977 deleted the cleanup_examples_folder_45_autoregressive_actions branch January 21, 2025 13:26
simonsays1980 pushed a commit to simonsays1980/ray that referenced this pull request Jan 23, 2025
anson627 pushed a commit to anson627/ray that referenced this pull request Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests rllib RLlib related issues rllib-docs-or-examples Issues related to RLlib documentation or rllib/examples rllib-models An issue related to RLlib (default or custom) Models. rllib-newstack rllib-oldstack-cleanup Issues related to cleaning up classes, utilities on the old API stack tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants