Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature(zjow): add Implicit Q-Learning #821

Merged
merged 9 commits into from
Jan 27, 2025
Merged

Conversation

zjowowen
Copy link
Collaborator

Add Implicit Q-Learning (IQL) algorithm.

@zjowowen zjowowen added the algo Add new algorithm or improve old one label Jul 29, 2024
@PaParaZz1 PaParaZz1 changed the title feature(zjow): Add Implicit Q-Learning feature(zjow): add Implicit Q-Learning Jul 29, 2024
),
collect=dict(data_type='d4rl', ),
eval=dict(evaluator=dict(eval_freq=5000, )),
other=dict(replay_buffer=dict(replay_buffer_size=2000000, ), ),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why replay buffer here

config = Path(__file__).absolute().parent.parent / 'config' / args.config
config = read_config(str(config))
config[0].exp_name = config[0].exp_name.replace('0', str(args.seed))
serial_pipeline_offline(config, seed=args.seed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not add max_train_iter

@@ -114,6 +114,38 @@ def __init__(self, cfg: dict) -> None:
except (KeyError, AttributeError):
# do not normalize
pass
if hasattr(cfg.env, "reward_norm"):
if cfg.env.reward_norm == "normalize":
dataset['rewards'] = (dataset['rewards'] - dataset['rewards'].mean()) / dataset['rewards'].std()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a eps

@@ -0,0 +1,654 @@
from typing import List, Dict, Any, Tuple, Union
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add this policy into the table in readme

# (str type) action_space: Use reparameterization trick for continous action
action_space='reparameterization',
# (int) Hidden size for actor network head.
actor_head_hidden_size=512,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add more comments for each arguments

'policy_grad_norm': policy_grad_norm,
}

def _get_policy_actions(self, data: Dict, num_actions: int = 10, epsilon: float = 1e-6) -> List:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this method used

# 9. update policy network
self._optimizer_policy.zero_grad()
policy_loss.backward()
policy_grad_norm = torch.nn.utils.clip_grad_norm_(self._model.actor.parameters(), 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enable the argument can be set in the optimizer

transforms=[TanhTransform(cache_size=1),
AffineTransform(loc=0.0, scale=1.05)]
)
next_action = next_obs_dist.rsample()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why rsample rather than sample here

log_prob = dist.log_prob(action)

eval_data = {'obs': obs, 'action': action}
new_value = self._learn_model.forward(eval_data, mode='compute_critic')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you can use with torch.no_grad() here

with torch.no_grad():
(mu, sigma) = self._collect_model.forward(data, mode='compute_actor')['logit']
dist = Independent(Normal(mu, sigma), 1)
action = torch.tanh(dist.rsample())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for offline RL algorithm, you may opt to leave the methods related to collect with empty

@PaParaZz1 PaParaZz1 merged commit dae7673 into opendilab:main Jan 27, 2025
10 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
algo Add new algorithm or improve old one
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants