diff --git a/docs/source/_static/figures/pseucodes/pseucode-C51.png b/docs/source/_static/figures/pseucodes/pseucode-C51.png new file mode 100644 index 000000000..ebcaccbb8 Binary files /dev/null and b/docs/source/_static/figures/pseucodes/pseucode-C51.png differ diff --git a/docs/source/documents/api/agents/drl/c51_agent.md b/docs/source/documents/api/agents/drl/c51_agent.md new file mode 100644 index 000000000..57f881ca6 --- /dev/null +++ b/docs/source/documents/api/agents/drl/c51_agent.md @@ -0,0 +1,193 @@ +# Categorical 51 DQN + +**Paper Link:** [**https://proceedings.mlr.press/v70/bellemare17a.html**](https://proceedings.mlr.press/v70/bellemare17a.html). + +The C51 algorithm (Categorical DQN) is a variant of the +[**DQN**](./dqn_agent.md) +that introduces a distributional approach to reinforcement learning. +Instead of predicting a single scalar value for the Q-function (expected future reward), +C51 predicts a probability distribution over a discrete set of possible returns (rewards), +enabling the agent to learn not just the expected return but also its uncertainty. + +This table lists some general features about C51 algorithm: + +| Features of C51 | Values | Description | +|-------------------|--------|----------------------------------------------------------| +| On-policy | ❌ | The evaluate policy is the same as the target policy. | +| Off-policy | ✅ | The evaluate policy is different from the target policy. | +| Model-free | ✅ | No need to prepare an environment dynamics model. | +| Model-based | ❌ | Need an environment model to train the policy. | +| Discrete Action | ✅ | Deal with discrete action space. | +| Continuous Action | ❌ | Deal with continuous action space. | + +## Method + +### Problem With DQN + +In regular DQN, the goal is to learn the expected return for each action in a given state: + +$$ +Q(s, a) = \mathbb{E}[G_t | S_t=s, A_t=a], +$$ + +where + +$$ +G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots = \sum^{\infty}_{k=0} {\gamma^k R_{t+k+1}} +$$ + +Instead of learning a single number $Q(s, a)$, +C51 learns the entire distribution of possible rewards for each action, denoted as $Z(s, a)$. + +C51 divides the possible range of rewards (e.g., from -1 to 1) into a fixed number of bins (called atoms). +For each atom, the algorithm predicts the probability that the reward will fall into that atom. + +The agent can now make decisions based on both the expected reward and its uncertainty. + +### Categorical Representation + +To apporoximate $Z(s, a)$, C51 represents it as a categorical distribution over $N$ discrete atoms +within a predefined value range $[v_{min}, v_{max}]$. The atoms are defined as: + +$$ +z_i = v_min + i \cdot \Delta, \Delta = \frac{v_{max} - v_{min}}{N-1}, i=0, 1, \dots, N-1. +$$ + +Each atom $z_i$ is associated with a probability $p_i$, forming the categorical distribution: + +$$ +P(Z(s,a)=z_i) = p_i, \sum_{i=0}^{N-1}{p_i}=1. +$$ + +### Distributional Bellman Equation + +The Bellman equation for the return distribution is given by: + +$$ +Z(s, a) := R + \gamma Z(S', A'). +$$ + +This equation states that the distribution of returns for the current state-action pair is determined +by the immediate reward $R$ and the discounted distribution of returns from the next state $S'$. + +In practice, The next-state distribution $Z(S', A')$ is projected back onto the fixed set of atoms $z_i$ to maintain consistency. + +## Algorithm + +The full algorithm for training C51 is presented in Algorithm 1. + +```{eval-rst} +.. image:: ./../../../../_static/figures/pseucodes/pseucode-C51.png + :width: 65% + :align: center + +.. note:: + + Algorithm 1 computes the projection in time linear in N. +``` + +## Run C51 in XuanCe + +Before running C51 in XuanCe, you need to prepare a conda environment and install ``xuance`` following +the [**installation steps**](./../../../usage/installation.rst#install-via-pypi). + +### Run Build-in Demos + +After completing the installation, you can open a Python console and run C51 directly using the following commands: + +```python3 +import xuance +runner = xuance.get_runner(method='c51', + env='classic_control', # Choices: claasi_control, box2d, atari. + env_id='CartPole-v1', # Choices: CartPole-v1, LunarLander-v2, ALE/Breakout-v5, etc. + is_test=False) +runner.run() # Or runner.benchmark() +``` + +### Run With Self-defined Configs + +If you want to run C51 with different configurations, you can build a new ``.yaml`` file, e.g., ``my_config.yaml``. +Then, run the C51 by the following code block: + +```python3 +import xuance as xp +runner = xp.get_runner(method='c51', + env='classic_control', # Choices: claasi_control, box2d, atari. + env_id='CartPole-v1', # Choices: CartPole-v1, LunarLander-v2, ALE/Breakout-v5, etc. + config_path="my_config.yaml", # The path of my_config.yaml file should be correct. + is_test=False) +runner.run() # Or runner.benchmark() +``` + +To learn more about the configurations, please visit the +[**tutorial of configs**](./../../configs/configuration_examples.rst). + +### Run With Customized Environment + +If you would like to run XuanCe's C51 in your own environment that was not included in XuanCe, +you need to define the new environment following the steps in +[**New Environment Tutorial**](./../../../usage/new_envs.rst). +Then, [**prepapre the configuration file**](./../../../usage/new_envs.rst#step-2-create-the-config-file-and-read-the-configurations) + ``c51_myenv.yaml``. + +After that, you can run C51 in your own environment with the following code: + +```python3 +import argparse +from xuance.common import get_configs +from xuance.environment import REGISTRY_ENV +from xuance.environment import make_envs +from xuance.torch.agents import C51_Agent + +configs_dict = get_configs(file_dir="c51_myenv.yaml") +configs = argparse.Namespace(**configs_dict) +REGISTRY_ENV[configs.env_name] = MyNewEnv + +envs = make_envs(configs) # Make parallel environments. +Agent = C51_Agent(config=configs, envs=envs) # Create a DDPG agent from XuanCe. +Agent.train(configs.running_steps // configs.parallels) # Train the model for numerous steps. +Agent.save_model("final_train_model.pth") # Save the model to model_dir. +Agent.finish() # Finish the training. +``` + +## Citation + +```{code-block} bash +@inproceedings{bellemare2017distributional, + title={A distributional perspective on reinforcement learning}, + author={Bellemare, Marc G and Dabney, Will and Munos, R{\'e}mi}, + booktitle={International conference on machine learning}, + pages={449--458}, + year={2017}, + organization={PMLR} +} +``` + +## APIs + +### PyTorch + +```{eval-rst} +.. automodule:: xuance.torch.agents.qlearning_family.c51_agent + :members: + :undoc-members: + :show-inheritance: +``` + +### TensorFlow2 + +```{eval-rst} +.. automodule:: xuance.tensorflow.agents.qlearning_family.c51_agent + :members: + :undoc-members: + :show-inheritance: +``` + +### MindSpore + +```{eval-rst} +.. automodule:: xuance.mindspore.agents.qlearning_family.c51_agent + :members: + :undoc-members: + :show-inheritance: +``` diff --git a/docs/source/documents/api/agents/drl/c51_agent.rst b/docs/source/documents/api/agents/drl/c51_agent.rst deleted file mode 100644 index 35a8acd47..000000000 --- a/docs/source/documents/api/agents/drl/c51_agent.rst +++ /dev/null @@ -1,26 +0,0 @@ -Categorical 51 DQN -^^^^^^^^^^^^^^^^^^^^^^ - -PyTorch -'''''''''''' - -.. automodule:: xuance.torch.agents.qlearning_family.c51_agent - :members: - :undoc-members: - :show-inheritance: - -TensorFlow2 -'''''''''''' - -.. automodule:: xuance.tensorflow.agents.qlearning_family.c51_agent - :members: - :undoc-members: - :show-inheritance: - -MindSpore -'''''''''''' - -.. automodule:: xuance.mindspore.agents.qlearning_family.c51_agent - :members: - :undoc-members: - :show-inheritance: