diff --git a/week06_policy_based/a2c-optional.ipynb b/week06_policy_based/a2c-optional.ipynb index 4cb4186ad..316a4e21e 100644 --- a/week06_policy_based/a2c-optional.ipynb +++ b/week06_policy_based/a2c-optional.ipynb @@ -144,10 +144,10 @@ "To train the part of the model that predicts state values you will need to compute the value targets. \n", "Any callable could be passed to `EnvRunner` to be applied to each partial trajectory after it is collected. \n", "Thus, we can implement and use `ComputeValueTargets` callable. \n", - "The formula for the value targets is simple:\n", + "The formula for the value targets is simple, it's the right side of the following equation:\n", "\n", "$$\n", - "\\hat v(s_t) = \\left( \\sum_{t'=0}^{T - 1 - t} \\gamma^{t'}r_{t+t'} \\right) + \\gamma^T \\hat{v}(s_{t+T}),\n", + "V(s_t) = \\left( \\sum_{t'=0}^{T - 1 - t} \\gamma^{t'} \\cdot r (s_{t+t'}, a_{t + t'}) \\right) + \\gamma^T \\cdot V(s_{t+T}),\n", "$$\n", "\n", "In implementation, however, do not forget to use \n", @@ -165,7 +165,7 @@ "class ComputeValueTargets:\n", " def __init__(self, policy, gamma=0.99):\n", " self.policy = policy\n", - " \n", + "\n", " def __call__(self, trajectory):\n", " # This method should modify trajectory inplace by adding\n", " # an item with key 'value_targets' to it.\n", @@ -214,7 +214,58 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Now is the time to implement the advantage actor critic algorithm itself. You can look into your lecture,\n", + "# Actor-critic objective\n", + "\n", + "Here we define a loss function that uses rollout above to train advantage actor-critic agent.\n", + "\n", + "\n", + "Our loss consists of three components:\n", + "\n", + "* __The policy \"loss\"__\n", + " $$ \\hat J = {1 \\over T} \\cdot \\sum_t { \\log \\pi(a_t | s_t) } \\cdot A_{const}(s,a) $$\n", + " * This function has no meaning in and of itself, but it was built such that\n", + " * $ \\nabla \\hat J = {1 \\over N} \\cdot \\sum_t { \\nabla \\log \\pi(a_t | s_t) } \\cdot A(s,a) \\approx \\nabla E_{s, a \\sim \\pi} R(s,a) $\n", + " * Therefore if we __maximize__ J_hat with gradient descent we will maximize expected reward\n", + " \n", + " \n", + "* __The value \"loss\"__\n", + " $$ L_{td} = {1 \\over T} \\cdot \\sum_t { [r + \\gamma \\cdot V_{const}(s_{t+1}) - V(s_t)] ^ 2 }$$\n", + " * Ye Olde TD_loss from q-learning and alike\n", + " * If we minimize this loss, V(s) will converge to $V_\\pi(s) = E_{a \\sim \\pi(a | s)} R(s,a) $\n", + "\n", + "\n", + "* __Entropy Regularizer__\n", + " $$ H = - {1 \\over T} \\sum_t \\sum_a {\\pi(a|s_t) \\cdot \\log \\pi (a|s_t)}$$\n", + " * If we __maximize__ entropy we discourage agent from predicting zero probability to actions\n", + " prematurely (a.k.a. exploration)\n", + " \n", + " \n", + "So we optimize a linear combination of $L_{td}$ $- \\hat J$, $-H$\n", + " \n", + "```\n", + "\n", + "```\n", + "\n", + "```\n", + "\n", + "```\n", + "\n", + "```\n", + "\n", + "```\n", + "\n", + "\n", + "__One more thing:__ since we train on T-step rollouts, we can use N-step formula for advantage for free:\n", + " * At the last step, $A(s_t,a_t) = r(s_t, a_t) + \\gamma \\cdot V(s_{t+1}) - V(s) $\n", + " * One step earlier, $A(s_t,a_t) = r(s_t, a_t) + \\gamma \\cdot r(s_{t+1}, a_{t+1}) + \\gamma ^ 2 \\cdot V(s_{t+2}) - V(s) $\n", + " * Et cetera, et cetera. This way agent starts training much faster since it's estimate of A(s,a) depends less on his (imperfect) value function and more on actual rewards. There's also a [nice generalization](https://arxiv.org/abs/1506.02438) of this." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can also look into your lecture,\n", "[Mnih et al. 2016](https://arxiv.org/abs/1602.01783) paper, and [lecture](https://www.youtube.com/watch?v=Tol_jw5hWnI&list=PLkFD6_40KJIxJMR-j5A1mkxK26gh_qg37&index=20) by Sergey Levine." ] }, @@ -288,9 +339,22 @@ } ], "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", "name": "python", - "pygments_lexer": "ipython3" + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.3" } }, "nbformat": 4, diff --git a/week06_policy_based/atari_wrappers.py b/week06_policy_based/atari_wrappers.py index c3c45740b..4bc23f50a 100644 --- a/week06_policy_based/atari_wrappers.py +++ b/week06_policy_based/atari_wrappers.py @@ -213,12 +213,16 @@ def __init__(self, env, prefix=None, running_mean_size=100): self.episode_counter = 0 self.prefix = prefix or self.env.spec.id - nenvs = getattr(self.env.unwrapped, "nenvs", 1) - self.rewards = np.zeros(nenvs) - self.had_ended_episodes = np.zeros(nenvs, dtype=np.bool) - self.episode_lengths = np.zeros(nenvs) + self.nenvs = getattr(self.env.unwrapped, "nenvs", 1) + self.rewards = np.zeros(self.nenvs) + self.had_ended_episodes = np.zeros(self.nenvs, dtype=np.bool) + self.episode_lengths = np.zeros(self.nenvs) self.reward_queues = [deque([], maxlen=running_mean_size) - for _ in range(nenvs)] + for _ in range(self.nenvs)] + self.global_step = 0 + + def add_summary_scalar(self, name, value): + raise NotImplementedError def should_write_summaries(self): """ Returns true if it's time to write summaries. """ @@ -260,6 +264,8 @@ def step(self, action): self.reward_queues[i].append(self.rewards[i]) self.rewards[i] = 0 + self.global_step += self.nenvs + if self.should_write_summaries(): self.add_summaries() return obs, rew, done, info @@ -272,19 +278,22 @@ def reset(self, **kwargs): class TFSummaries(SummariesBase): - """ Writes env summaries using TensorFlow.""" + """ Writes env summaries using TensorFlow. + In order to write summaries in a specific directory, + you may define a writer and set it as default just before + training loop as in an example here + https://www.tensorflow.org/api_docs/python/tf/summary + Other summaries could be added in A2C class or elsewhere + """ - def __init__(self, env, prefix=None, running_mean_size=100, step_var=None): + def __init__(self, env, prefix=None, + running_mean_size=100, step_var=None): super().__init__(env, prefix, running_mean_size) - import tensorflow as tf - self.step_var = (step_var if step_var is not None - else tf.train.get_global_step()) - def add_summary_scalar(self, name, value): import tensorflow as tf - tf.contrib.summary.scalar(name, value, step = self.step_var) + tf.summary.scalar(name, value, self.global_step) class NumpySummaries(SummariesBase): @@ -304,7 +313,7 @@ def get_values(cls, name): def clear(cls): cls._summaries = defaultdict(list) - def __init__(self, env, prefix = None, running_mean_size = 100): + def __init__(self, env, prefix=None, running_mean_size=100): super().__init__(env, prefix, running_mean_size) def add_summary_scalar(self, name, value): @@ -316,6 +325,7 @@ def nature_dqn_env(env_id, nenvs=None, seed=None, """ Wraps env as in Nature DQN paper. """ if "NoFrameskip" not in env_id: raise ValueError(f"env_id must have 'NoFrameskip' but is {env_id}") + if nenvs is not None: if seed is None: seed = list(range(nenvs)) @@ -327,20 +337,30 @@ def nature_dqn_env(env_id, nenvs=None, seed=None, env = ParallelEnvBatch([ lambda i=i, env_seed=env_seed: nature_dqn_env( - env_id, seed=env_seed, summaries=False, clip_reward=False) + env_id, seed=env_seed, summaries=None, clip_reward=False) for i, env_seed in enumerate(seed) ]) - if summaries: - summaries_class = NumpySummaries if summaries == 'Numpy' else TFSummaries - env = summaries_class(env, prefix=env_id) + if summaries is not None: + if summaries == 'Numpy': + env = NumpySummaries(env, prefix=env_id) + elif summaries == 'TensorFlow': + env = TFSummaries(env, prefix=env_id) + else: + raise ValueError( + f"Unknown `summaries` value: expected either 'Numpy' or 'TensorFlow', got {summaries}") if clip_reward: env = ClipReward(env) return env env = gym.make(env_id) env.seed(seed) - if summaries: + if summaries == 'Numpy': + env = NumpySummaries(env) + elif summaries == 'TensorFlow': env = TFSummaries(env) + elif summaries: + raise ValueError(f"summaries must be either Numpy, " + f"or TensorFlow, or a falsy value, but is {summaries}") env = EpisodicLife(env) if "FIRE" in env.unwrapped.get_action_meanings(): env = FireReset(env) diff --git a/week06_policy_based/local_setup.sh b/week06_policy_based/local_setup.sh new file mode 100644 index 000000000..8e68e70fd --- /dev/null +++ b/week06_policy_based/local_setup.sh @@ -0,0 +1,9 @@ +#!/usr/bin/env bash + +apt-get install -yqq ffmpeg +apt-get install -yqq python-opengl + +python3 -m pip install --user gym==0.14.0 +python3 -m pip install --user pygame +python3 -m pip install --user pyglet==1.3.2 +python3 -m pip install --user tensorflow>=2.0.0 diff --git a/week06_policy_based/reinforce_tensorflow.ipynb b/week06_policy_based/reinforce_tensorflow.ipynb index 7be68a20a..8929622b2 100644 --- a/week06_policy_based/reinforce_tensorflow.ipynb +++ b/week06_policy_based/reinforce_tensorflow.ipynb @@ -11,25 +11,40 @@ "Most of the code in this notebook is taken from approximate Q-learning, so you'll find it more or less familiar and even simpler." ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "__Necessery dependencies:__\n", + "`ffmpeg`\n", + "`python-opengl`\n", + "`gym`\n", + "`pygame`\n", + "`pyglet`\n", + "`tensorflow==2.x`\n", + "\n", + "__Recomended dependencies:__\n", + "`gym==0.14.0`\n", + "`pyglet==1.3.2`" + ] + }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "import sys, os\n", + "import os, sys\n", + "\n", "if 'google.colab' in sys.modules:\n", - " %tensorflow_version 1.x\n", - " \n", " if not os.path.exists('.setup_complete'):\n", " !wget -q https://raw.githubusercontent.com/yandexdataschool/Practical_RL/spring20/setup_colab.sh -O- | bash\n", + " !wget -q https://raw.githubusercontent.com/yandexdataschool/Practical_RL/spring20/local_setup.sh -O- | bash\n", " !touch .setup_complete\n", - "\n", - "# This code creates a virtual display to draw game images on.\n", - "# It will have no effect if your machine has a monitor.\n", - "if type(os.environ.get(\"DISPLAY\")) is not str or len(os.environ.get(\"DISPLAY\")) == 0:\n", - " !bash ../xvfb start\n", - " os.environ['DISPLAY'] = ':1'" + "else:\n", + " pass\n", + " # If you don't have tensorflow 2.0 or gym, uncomment this and for an automatic setup (look inside before you run!)\n", + " # !./local_setup.sh" ] }, { @@ -38,17 +53,12 @@ "metadata": {}, "outputs": [], "source": [ - "import gym\n", - "import numpy as np\n", - "import matplotlib.pyplot as plt\n", - "%matplotlib inline" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "A caveat: we have received reports that the following cell may crash with `NameError: name 'base' is not defined`. The [suggested workaround](https://www.coursera.org/learn/practical-rl/discussions/all/threads/N2Pw652iEemRYQ6W2GuqHg/replies/te3HpQwOQ62tx6UMDoOt2Q/comments/o08gTqelT9KPIE6npX_S3A) is to install `gym==0.14.0` and `pyglet==1.3.2`." + "# This code creates a virtual display to draw game images on.\n", + "# It will have no effect if your machine has a monitor.\n", + "\n", + "if type(os.environ.get(\"DISPLAY\")) is not str or len(os.environ.get(\"DISPLAY\")) == 0:\n", + " !bash ../xvfb start\n", + " os.environ['DISPLAY'] = ':1'" ] }, { @@ -57,6 +67,11 @@ "metadata": {}, "outputs": [], "source": [ + "import gym\n", + "import matplotlib.pyplot as plt\n", + "%matplotlib inline\n", + "import numpy as np\n", + "\n", "env = gym.make(\"CartPole-v0\")\n", "\n", "# gym compatibility: unwrap TimeLimit\n", @@ -95,46 +110,24 @@ "source": [ "import tensorflow as tf\n", "\n", - "sess = tf.InteractiveSession()" + "model = " ] }, { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# create input variables. We only need for REINFORCE\n", - "ph_states = tf.placeholder('float32', (None,) + state_dim, name=\"states\")\n", - "ph_actions = tf.placeholder('int32', name=\"action_ids\")\n", - "ph_cumulative_rewards = tf.placeholder('float32', name=\"cumulative_returns\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "from keras.models import Sequential\n", - "from keras.layers import Dense\n", - "\n", - "\n", - "\n", - "logits = \n", - "\n", - "policy = tf.nn.softmax(logits)\n", - "log_policy = tf.nn.log_softmax(logits)" + "#### Predict function" ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "# Initialize model parameters\n", - "sess.run(tf.global_variables_initializer())" + "Note: output value of this function is not a tf tensor, it's a numpy array.\n", + "So, here gradient calculation is not needed. If you wrote in pytorch, you would need something like `torch.no_grad` to avoid calculation of gradients. Tensorflow doesn't compute gradients at forward pass, so no additional actions needed here." ] }, { @@ -144,12 +137,15 @@ "outputs": [], "source": [ "def predict_probs(states):\n", - " \"\"\" \n", + " \"\"\"\n", " Predict action probabilities given states.\n", " :param states: numpy array of shape [batch, state_shape]\n", " :returns: numpy array of shape [batch, n_actions]\n", " \"\"\"\n", - " return policy.eval({ph_states: [states]})[0]" + " states = \n", + " logits = model(states)\n", + " policy = \n", + " return policy" ] }, { @@ -168,7 +164,7 @@ "outputs": [], "source": [ "def generate_session(env, t_max=1000):\n", - " \"\"\" \n", + " \"\"\"\n", " Play a full session with REINFORCE agent.\n", " Returns sequences of states, actions, and rewards.\n", " \"\"\"\n", @@ -178,7 +174,7 @@ "\n", " for t in range(t_max):\n", " # action probabilities array aka pi(a|s)\n", - " action_probs = predict_probs(s)\n", + " action_probs = predict_probs(np.asarray([s]))[0]\n", "\n", " # Sample action with given probabilities.\n", " a = \n", @@ -231,9 +227,9 @@ " gamma=0.99 # discount for reward\n", " ):\n", " \"\"\"\n", - " Take a list of immediate rewards r(s,a) for the whole session \n", + " Take a list of immediate rewards r(s,a) for the whole session\n", " and compute cumulative returns (a.k.a. G(s,a) in Sutton '16).\n", - " \n", + "\n", " G_t = r_t + gamma*r_{t+1} + gamma^2*r_{t+2} + ...\n", "\n", " A simple way to compute cumulative rewards is to iterate from the last\n", @@ -251,7 +247,7 @@ "metadata": {}, "outputs": [], "source": [ - "assert len(get_cumulative_rewards(range(100))) == 100\n", + "assert len(get_cumulative_rewards(list(range(100)))) == 100\n", "assert np.allclose(\n", " get_cumulative_rewards([0, 0, 1, 0, 0, 1, 0], gamma=0.9),\n", " [1.40049, 1.5561, 1.729, 0.81, 0.9, 1.0, 0.0])\n", @@ -293,22 +289,12 @@ "metadata": {}, "outputs": [], "source": [ - "# This code selects the log-probabilities (log pi(a_i|s_i)) for those actions that were actually played.\n", - "indices = tf.stack([tf.range(tf.shape(log_policy)[0]), ph_actions], axis=-1)\n", - "log_policy_for_actions = tf.gather_nd(log_policy, indices)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Policy objective as in the last formula. Please use reduce_mean, not reduce_sum.\n", - "# You may use log_policy_for_actions to get log probabilities for actions taken.\n", - "# Also recall that we defined ph_cumulative_rewards earlier.\n", - "\n", - "J = " + "def select_log_policy_for_actions(log_policy, actions):\n", + " # This code selects the log-probabilities (log pi(a_i|s_i))\n", + " # for those actions that were actually played.\n", + " indices = tf.stack([tf.range(tf.shape(log_policy)[0]), actions], axis=-1)\n", + " log_policy_for_actions = tf.gather_nd(log_policy, indices)\n", + " return log_policy_for_actions" ] }, { @@ -326,10 +312,7 @@ "metadata": {}, "outputs": [], "source": [ - "# Entropy regularization. If you don't add it, the policy will quickly deteriorate to\n", - "# being deterministic, harming exploration.\n", - "\n", - "entropy = " + "optimizer = " ] }, { @@ -338,39 +321,29 @@ "metadata": {}, "outputs": [], "source": [ - "# # Maximizing X is the same as minimizing -X, hence the sign.\n", - "loss = -(J + 0.1 * entropy)\n", - "\n", - "update = tf.train.AdamOptimizer().minimize(loss)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "def train_on_session(states, actions, rewards, t_max=1000):\n", + "def train_on_session(states, actions, rewards):\n", " \"\"\"given full session, trains agent with policy gradient\"\"\"\n", - " cumulative_rewards = get_cumulative_rewards(rewards)\n", - " update.run({\n", - " ph_states: states,\n", - " ph_actions: actions,\n", - " ph_cumulative_rewards: cumulative_rewards,\n", - " })\n", + " cumulative_returns = \n", + "\n", + " states = tf.keras.backend.constant(states)\n", + " cumulative_returns = tf.keras.backend.constant(cumulative_returns)\n", + " actions = tf.keras.backend.constant(actions, dtype='int32')\n", + "\n", + " with tf.GradientTape() as tape:\n", + " logits = \n", + " policy = tf.nn.softmax(logits)\n", + " log_policy = tf.nn.log_softmax(logits)\n", + " log_policy_for_actions = \n", + "\n", + " J = \n", + " entropy = \n", + " loss = -(J + 0.1 * entropy)\n", + " grads = tape.gradient(loss, model.trainable_variables)\n", + " optimizer.apply_gradients(zip(grads, model.trainable_variables))\n", + "\n", " return sum(rewards)" ] }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Initialize optimizer parameters\n", - "sess.run(tf.global_variables_initializer())" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -438,9 +411,22 @@ } ], "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", "name": "python", - "pygments_lexer": "ipython3" + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.3" } }, "nbformat": 4, diff --git a/week08_pomdp/img1_tf.jpg b/week08_pomdp/img1_tf.jpg new file mode 100644 index 000000000..4af2d9ad2 Binary files /dev/null and b/week08_pomdp/img1_tf.jpg differ diff --git a/week08_pomdp/practice_tensorflow.ipynb b/week08_pomdp/practice_tensorflow.ipynb index ffcd8eed3..36a8d1dd0 100644 --- a/week08_pomdp/practice_tensorflow.ipynb +++ b/week08_pomdp/practice_tensorflow.ipynb @@ -17,17 +17,21 @@ "metadata": {}, "outputs": [], "source": [ - "import sys, os\n", - "if 'google.colab' in sys.modules:\n", - " %tensorflow_version 1.x\n", - " \n", - " if not os.path.exists('.setup_complete'):\n", - " !wget -q https://raw.githubusercontent.com/yandexdataschool/Practical_RL/spring20/setup_colab.sh -O- | bash\n", - "\n", - " !wget -q https://raw.githubusercontent.com/yandexdataschool/Practical_RL/spring20/week08_pomdp/atari_util.py\n", - "\n", - " !touch .setup_complete\n", + "import os, sys\n", "\n", + "if 'google.colab' in sys.modules and not os.path.exists('.setup_complete'):\n", + " !wget -q https://raw.githubusercontent.com/yandexdataschool/Practical_RL/spring20/setup_colab.sh -O- | bash\n", + " !wget -q https://raw.githubusercontent.com/yandexdataschool/Practical_RL/spring20/week08_pomdp/atari_util.py\n", + " !wget -q https://raw.githubusercontent.com/yandexdataschool/Practical_RL/spring20/week08_pomdp/env_pool.py\n", + " !touch .setup_complete" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "# If you are running on a server, launch xvfb to record game videos\n", "# Please make sure you have xvfb installed\n", "if type(os.environ.get(\"DISPLAY\")) is not str or len(os.environ.get(\"DISPLAY\")) == 0:\n", @@ -41,10 +45,9 @@ "metadata": {}, "outputs": [], "source": [ - "import numpy as np\n", - "\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", + "import numpy as np\n", "\n", "from IPython.display import display" ] @@ -72,13 +75,12 @@ " env = gym.make(\"KungFuMasterDeterministic-v0\")\n", " env = PreprocessAtari(\n", " env, height=42, width=42,\n", - " crop=lambda img: img[60:-30, 5:],\n", + " crop=lambda img: img[60:-30, 15:],\n", " dim_order='tensorflow',\n", " color=False, n_frames=4)\n", " return env\n", "\n", "env = make_env()\n", - "\n", "obs_shape = env.observation_space.shape\n", "n_actions = env.action_space.n\n", "\n", @@ -110,9 +112,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### Simple agent for fully-observable MDP\n", + "### POMDP setting\n", "\n", - "Here's a code for an agent that only uses feedforward layers. Please read it carefully: you'll have to extend it later!" + "The atari game we're working with is actually a POMDP: your agent needs to know timing at which enemies spawn and move, but cannot do so unless it has some memory.\n", + "\n", + "Let's design another agent that has a recurrent neural net memory to solve this. Here's a sketch.\n", + "\n", + "![img](img1_tf.jpg)" ] }, { @@ -122,8 +128,7 @@ "outputs": [], "source": [ "import tensorflow as tf\n", - "tf.reset_default_graph()\n", - "sess = tf.InteractiveSession()" + "from tensorflow.keras.layers import Conv2D, Dense, Flatten, LSTMCell" ] }, { @@ -132,66 +137,77 @@ "metadata": {}, "outputs": [], "source": [ - "from keras.layers import Conv2D, Dense, Flatten\n", - "\n", - "\n", - "class FeedforwardAgent:\n", - " def __init__(self, name, obs_shape, n_actions, reuse=False):\n", + "class SimpleRecurrentAgent:\n", + " def __init__(self, obs_shape, n_actions):\n", " \"\"\"A simple actor-critic agent\"\"\"\n", "\n", - " with tf.variable_scope(name, reuse=reuse):\n", - " # Note: number of units/filters is arbitrary, you can and should change it at your will\n", - " self.conv0 = Conv2D(32, (3, 3), strides=(2, 2), activation='elu')\n", - " self.conv1 = Conv2D(32, (3, 3), strides=(2, 2), activation='elu')\n", - " self.conv2 = Conv2D(32, (3, 3), strides=(2, 2), activation='elu')\n", - " self.flatten = Flatten()\n", - " self.hid = Dense(128, activation='elu')\n", - " self.logits = Dense(n_actions)\n", - " self.state_value = Dense(1)\n", - "\n", - " # prepare a graph for agent step\n", - " _initial_state = self.get_initial_state(1)\n", - " self.prev_state_placeholders = [\n", - " tf.placeholder(m.dtype, [None] + [m.shape[i] for i in range(1, m.ndim)])\n", - " for m in _initial_state\n", - " ]\n", - " self.obs_t = tf.placeholder('float32', [None, ] + list(obs_shape))\n", - " self.next_state, self.agent_outputs = self.symbolic_step(self.prev_state_placeholders, self.obs_t)\n", - "\n", - " def symbolic_step(self, prev_state, obs_t):\n", - " \"\"\"Takes agent's previous step and observation, returns next state and whatever it needs to learn (tf tensors)\"\"\"\n", - "\n", - " nn = self.conv0(obs_t)\n", - " nn = self.conv1(nn)\n", - " nn = self.conv2(nn)\n", - " nn = self.flatten(nn)\n", - " nn = self.hid(nn)\n", - " logits = self.logits(nn)\n", - " state_value = self.state_value(nn)\n", - "\n", - " # feedforward agent has no state\n", - " new_state = []\n", + " # Let's define our computational graph\n", + " # Note: number of units/filters is arbitrary, you can change it at your will\n", + "\n", + " # Our backbone is denoted by the part before LSTM\n", + " self.backbone = tf.keras.Sequential((\n", + " Conv2D(32, (3, 3), strides=(2, 2), activation='elu'),\n", + " Conv2D(32, (3, 3), strides=(2, 2), activation='elu'),\n", + " Conv2D(32, (3, 3), strides=(2, 2), activation='elu'),\n", + " Flatten(),\n", + " Dense(256, activation='relu'))\n", + " )\n", + " self.lstm_output_size = 256\n", + " self.lstm_cell = LSTMCell(self.lstm_output_size)\n", + "\n", + " self.logits_head = Dense(n_actions)\n", + " self.state_value_head = Dense(1)\n", + "\n", + " @property\n", + " def trainable_variables(self):\n", + " return self.backbone.trainable_variables + \\\n", + " self.lstm_cell.trainable_variables + \\\n", + " self.logits_head.trainable_variables + \\\n", + " self.state_value_head.trainable_variables\n", + "\n", + " def __call__(self, prev_state, obs_t):\n", + " return self.forward(prev_state, obs_t)\n", + "\n", + " def forward(self, prev_state, obs_t):\n", + " \"\"\"\n", + " Takes agent's previous hidden state and a new observation,\n", + " returns a new hidden state and whatever the agent needs to learn\n", + " \"\"\"\n", + "\n", + " # Apply the whole neural net for one step here.\n", + " # See docs on self.lstm_cell(...).\n", + " # The recurrent cell should take output of the backbone as input.\n", + " \n", "\n", + " new_state = \n", + " logits = \n", + " state_value = \n", " return new_state, (logits, state_value)\n", "\n", + " def step(self, prev_state, obs_t):\n", + " \"\"\"Same as forward except it takes as input and returns numpy arrays\n", + " (or maybe lists of numpy arrays as input, they're allowed as well)\"\"\"\n", + " prev_state = [tf.convert_to_tensor(prev_state[0], dtype='float32'),\n", + " tf.convert_to_tensor(prev_state[1], dtype='float32')]\n", + " obs_t = tf.convert_to_tensor(obs_t, dtype='float32')\n", + " new_state, outputs = self.forward(prev_state, obs_t)\n", + " new_state = (new_state[0].numpy(), new_state[1].numpy())\n", + " outputs = (outputs[0].numpy(), outputs[1].numpy())\n", + " return new_state, outputs\n", + "\n", " def get_initial_state(self, batch_size):\n", - " \"\"\"Return a list of agent memory states at game start. Each state is a np array of shape [batch_size, ...]\"\"\"\n", + " \"\"\"Return a list of agent memory states at game start.\n", + " Each state is a tf tensor of shape [batch_size, ...]\"\"\"\n", " # feedforward agent has no state\n", - " return []\n", - "\n", - " def step(self, prev_state, obs_t):\n", - " \"\"\"Same as symbolic state except it operates on numpy arrays\"\"\"\n", - " sess = tf.get_default_session()\n", - " feed_dict = {self.obs_t: obs_t}\n", - " for state_ph, state_value in zip(self.prev_state_placeholders, prev_state):\n", - " feed_dict[state_ph] = state_value\n", - " return sess.run([self.next_state, self.agent_outputs], feed_dict)\n", + " return [tf.zeros([batch_size, self.lstm_output_size], dtype='float32'),\n", + " tf.zeros([batch_size, self.lstm_output_size], dtype='float32')]\n", "\n", " def sample_actions(self, agent_outputs):\n", - " \"\"\"pick actions given numeric agent outputs (np arrays)\"\"\"\n", + " \"\"\"pick actions given numeric agent outputs (numpy arrays)\"\"\"\n", " logits, state_values = agent_outputs\n", - " policy = np.exp(logits) / np.sum(np.exp(logits), axis=-1, keepdims=True)\n", - " return np.array([np.random.choice(len(p), p=p) for p in policy])" + " policy = tf.nn.softmax(logits, axis=-1).numpy()\n", + " return [np.random.choice(len(p), p=p) for p in policy]" ] }, { @@ -203,9 +219,7 @@ "n_parallel_games = 5\n", "gamma = 0.99\n", "\n", - "agent = FeedforwardAgent(\"agent\", obs_shape, n_actions)\n", - "\n", - "sess.run(tf.global_variables_initializer())" + "agent = SimpleRecurrentAgent(obs_shape, n_actions)" ] }, { @@ -214,8 +228,8 @@ "metadata": {}, "outputs": [], "source": [ - "state = [env.reset()]\n", - "_, (logits, value) = agent.step(agent.get_initial_state(1), state)\n", + "obs = env.reset()[np.newaxis, :].astype('float32')\n", + "_, (logits, value) = agent.step(agent.get_initial_state(1), obs)\n", "print(\"action logits:\\n\", logits)\n", "print(\"state values:\\n\", value)" ] @@ -245,8 +259,9 @@ "\n", " total_reward = 0\n", " while True:\n", - " new_memories, readouts = agent.step(\n", - " prev_memories, observation[None, ...])\n", + " observation = observation[np.newaxis, :].astype('float32')\n", + " new_memories, readouts = agent.forward(\n", + " prev_memories, observation)\n", " action = agent.sample_actions(readouts)\n", "\n", " observation, reward, done, info = env.step(action[0])\n", @@ -266,6 +281,7 @@ "metadata": {}, "outputs": [], "source": [ + "%%time\n", "import gym.wrappers\n", "\n", "with gym.wrappers.Monitor(make_env(), directory=\"videos\", force=True) as env_monitor:\n", @@ -315,12 +331,26 @@ "pool = EnvPool(agent, make_env, n_parallel_games)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We gonna train our agent on a thing called __rollouts:__\n", + "![img](img3.jpg)\n", + "\n", + "A rollout is just a sequence of T observations, actions and rewards that agent took consequently.\n", + "* First __s0__ is not necessarily initial state for the environment\n", + "* Final state is not necessarily terminal\n", + "* We sample several parallel rollouts for efficiency" + ] + }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ + "%%time\n", "# for each of n_parallel_games, take 10 steps\n", "rollout_obs, rollout_actions, rollout_rewards, rollout_mask = pool.interact(10)" ] @@ -329,18 +359,7 @@ "cell_type": "code", "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Actions shape: (5, 10)\n", - "Rewards shape: (5, 10)\n", - "Mask shape: (5, 10)\n", - "Observations shape: (5, 10, 42, 42, 1)\n" - ] - } - ], + "outputs": [], "source": [ "print(\"Actions shape:\", rollout_actions.shape)\n", "print(\"Rewards shape:\", rollout_rewards.shape)\n", @@ -352,9 +371,54 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Actor-critic\n", + "# Actor-critic objective\n", + "\n", + "Here we define a loss function that uses rollout above to train advantage actor-critic agent.\n", + "\n", + "\n", + "Our loss consists of three components:\n", + "\n", + "* __The policy \"loss\"__\n", + " $$ \\hat J = {1 \\over T} \\cdot \\sum_t { \\log \\pi(a_t | s_t) } \\cdot A_{const}(s,a) $$\n", + " * This function has no meaning in and of itself, but it was built such that\n", + " * $ \\nabla \\hat J = {1 \\over N} \\cdot \\sum_t { \\nabla \\log \\pi(a_t | s_t) } \\cdot A(s,a) \\approx \\nabla E_{s, a \\sim \\pi} R(s,a) $\n", + " * Therefore if we __maximize__ J_hat with gradient descent we will maximize expected reward\n", + " \n", + " \n", + "* __The value \"loss\"__\n", + " $$ L_{td} = {1 \\over T} \\cdot \\sum_t { [r + \\gamma \\cdot V_{const}(s_{t+1}) - V(s_t)] ^ 2 }$$\n", + " * Ye Olde TD_loss from q-learning and alike\n", + " * If we minimize this loss, V(s) will converge to $V_\\pi(s) = E_{a \\sim \\pi(a | s)} R(s,a) $\n", + "\n", + "\n", + "* __Entropy Regularizer__\n", + " $$ H = - {1 \\over T} \\sum_t \\sum_a {\\pi(a|s_t) \\cdot \\log \\pi (a|s_t)}$$\n", + " * If we __maximize__ entropy we discourage agent from predicting zero probability to actions\n", + " prematurely (a.k.a. exploration)\n", + " \n", + " \n", + "So we optimize a linear combination of $L_{td}$ $- \\hat J$, $-H$\n", + " \n", + "```\n", + "\n", + "```\n", + "\n", + "```\n", + "\n", + "```\n", + "\n", + "```\n", + "\n", + "```\n", + "\n", + "\n", + "__One more thing:__ since we train on T-step rollouts, we can use N-step formula for advantage for free:\n", + " * At the last step, $A(s_t,a_t) = r(s_t, a_t) + \\gamma \\cdot V(s_{t+1}) - V(s) $\n", + " * One step earlier, $A(s_t,a_t) = r(s_t, a_t) + \\gamma \\cdot r(s_{t+1}, a_{t+1}) + \\gamma ^ 2 \\cdot V(s_{t+2}) - V(s) $\n", + " * Et cetera, et cetera. This way agent starts training much faster since it's estimate of A(s,a) depends less on his (imperfect) value function and more on actual rewards. There's also a [nice generalization](https://arxiv.org/abs/1506.02438) of this.\n", "\n", - "Here we define a loss function that uses rollout above to train" + "\n", + "__Note:__ it's also a good idea to scale rollout_len up to learn longer sequences. You may wish set it to >=20 or to start at 10 and then scale up as time passes." ] }, { @@ -363,25 +427,12 @@ "metadata": {}, "outputs": [], "source": [ - "observations_ph = tf.placeholder('float32', [None, None, ] + list(obs_shape))\n", - "actions_ph = tf.placeholder('int32', (None, None,))\n", - "rewards_ph = tf.placeholder('float32', (None, None,))\n", - "mask_ph = tf.placeholder('float32', (None, None,))\n", - "\n", - "initial_memory_ph = agent.prev_state_placeholders\n", - "dummy_outputs = agent.symbolic_step(\n", - " initial_memory_ph, observations_ph[:, 0])[1]\n", - "\n", - "_, outputs_seq = tf.scan(\n", - " lambda stack, obs_t: agent.symbolic_step(stack[0], obs_t),\n", - " initializer=(initial_memory_ph, dummy_outputs),\n", - " # [time, batch, h, w, c]\n", - " elems=tf.transpose(observations_ph, [1, 0, 2, 3, 4])\n", - ")\n", - "\n", - "# from [time, batch] back to [batch, time]\n", - "outputs_seq = [tf.transpose(\n", - " tensor, [1, 0] + list(range(2, tensor.shape.ndims))) for tensor in outputs_seq]" + "def select_log_policy_for_actions(log_policy, actions):\n", + " # This code selects the log-probabilities (log pi(a_i|s_i))\n", + " # for those actions that were actually played.\n", + " actions_one_hot = tf.one_hot(actions, n_actions)\n", + " log_policy_for_actions = tf.reduce_sum(log_policy * actions_one_hot, axis=-1)\n", + " return log_policy_for_actions" ] }, { @@ -390,35 +441,91 @@ "metadata": {}, "outputs": [], "source": [ - "# actor-critic losses\n", - "# logits shape: [batch, time, n_actions], states shape: [batch, time, n_actions]\n", - "logits_seq, state_values_seq = outputs_seq\n", + "def train_on_rollout(states, actions, rewards, is_not_done, prev_memory_states, gamma=0.99):\n", + " \"\"\"\n", + " Takes a sequence of states, actions and rewards produced by generate_session.\n", + " Updates agent's weights by following the policy gradient above.\n", + " Please use Adam optimizer with default parameters.\n", + " \"\"\"\n", + " # shape: [batch_size, time, c, h, w]\n", + " states = tf.convert_to_tensor(states, dtype='float32')\n", + " actions = tf.convert_to_tensor(actions, dtype='int32') # shape: [batch_size, time]\n", + " rewards = tf.convert_to_tensor(rewards, dtype='float32') # shape: [batch_size, time]\n", + " is_not_done = tf.convert_to_tensor(is_not_done, dtype='float32') # shape: [batch_size, time]\n", + " rollout_length = rewards.shape[1] - 1\n", + "\n", + " with tf.GradientTape() as tape:\n", + " # We want to stop gradient here to prevent it from travelling a long distance \n", + " memory = [tf.stop_gradient(mem_state) for mem_state in prev_memory_states]\n", + "\n", + " logits = [] # append logit sequence here\n", + " state_values = [] # append state values here\n", + " for t in range(rewards.shape[1]):\n", + " obs_t = states[:, t]\n", + "\n", + " # use agent to comute logits_t and state values_t.\n", + " # append them to logits and state_values array\n", + "\n", + " memory, (logits_t, values_t) = \n", + "\n", + " logits.append(logits_t)\n", + " state_values.append(values_t)\n", + "\n", + " logits = tf.stack(logits, axis=1)\n", + " state_values = tf.stack(state_values, axis=1)\n", + " policy = tf.nn.softmax(logits, axis=2)\n", + " log_policy = tf.nn.log_softmax(logits, axis=2)\n", + "\n", + " # select log-probabilities for chosen actions, log pi(a_i|s_i)\n", + " log_policy_for_actions = select_log_policy_for_actions(log_policy, actions)\n", + "\n", + " # Now let's compute two loss components:\n", + " # 1) Policy gradient objective.\n", + " # Notes: Please don't forget to call stop_gradient on advantage term. Also please use mean, not sum.\n", + " # it's okay to use loops if you want\n", + " J_hat = 0 # policy objective as in the formula for J_hat\n", + "\n", + " # 2) Temporal difference MSE for state values\n", + " # Notes: Please don't forget to call on V(s') term. Also please use mean, not sum.\n", + " # it's okay to use loops if you want\n", + " value_loss = 0\n", + "\n", + " cumulative_returns = tf.stop_gradient(state_values[:, -1]) * is_not_done[:, -1]\n", "\n", - "logprobs_seq = tf.nn.log_softmax(logits_seq)\n", - "logp_actions = tf.reduce_sum(\n", - " logprobs_seq * tf.one_hot(actions_ph, n_actions), axis=-1)[:, :-1]\n", + " for t in reversed(range(rollout_length)):\n", + " r_t = rewards[:, t] # current rewards\n", + " # current state values\n", + " V_t = state_values[:, t]\n", + " V_next = tf.stop_gradient(state_values[:, t + 1]) # next state values\n", + " # log-probability of a_t in s_t\n", + " logpi_a_s_t = log_policy_for_actions[:, t]\n", "\n", - "current_rewards = rewards_ph[:, :-1] / 100.\n", - "current_state_values = state_values_seq[:, :-1, 0]\n", - "next_state_values = state_values_seq[:, 1:, 0] * mask_ph[:, :-1]\n", + " # update G_t = r_t + gamma * G_{t+1} as we did in week6 reinforce\n", + " cumulative_returns = G_t = r_t + gamma * cumulative_returns * is_not_done[:, t]\n", "\n", - "# policy gradient\n", - "# compute 1-step advantage using current_rewards, current_state_values and next_state_values\n", - "advantage = \n", - "assert advantage.shape.ndims == 2\n", - "# compute policy entropy given logits_seq. Mind the sign!\n", - "entropy = \n", - "assert entropy.shape.ndims == 2\n", + " # Compute temporal difference error (MSE for V(s))\n", + " value_loss += \n", "\n", - "actor_loss = - tf.reduce_mean(logp_actions *\n", - " tf.stop_gradient(advantage)) - 1e-2 * tf.reduce_mean(entropy)\n", + " # compute advantage A(s_t, a_t) using cumulative returns and V(s_t) as baseline\n", + " advantage = \n", + " advantage = tf.stop_gradient(advantage)\n", "\n", - "# compute target qvalues using temporal difference\n", - "target_qvalues = \n", - "critic_loss = tf.reduce_mean(\n", - " (current_state_values - tf.stop_gradient(target_qvalues))**2)\n", + " # compute policy pseudo-loss aka -J_hat.\n", + " J_hat += \n", "\n", - "train_step = tf.train.AdamOptimizer(1e-5).minimize(actor_loss + critic_loss)" + " # regularize with entropy\n", + " entropy_reg = \n", + "\n", + " # add-up three loss components and average over time\n", + " loss = -J_hat / rollout_length +\\\n", + " value_loss / rollout_length +\\\n", + " -0.01 * entropy_reg\n", + "\n", + " # Gradient descent step\n", + " grads = tape.gradient(loss, agent.trainable_variables)\n", + " \n", + "\n", + " return loss.numpy()" ] }, { @@ -427,38 +534,30 @@ "metadata": {}, "outputs": [], "source": [ - "sess.run(tf.global_variables_initializer())" + "optimizer = " ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": null, "metadata": {}, + "outputs": [], "source": [ - "# Train \n", + "# let's test it\n", + "memory = list(pool.prev_memory_states)\n", + "rollout_obs, rollout_actions, rollout_rewards, rollout_mask = pool.interact(10)\n", "\n", - "just run train step and see if agent learns any better" + "train_on_rollout(rollout_obs, rollout_actions,\n", + " rollout_rewards, rollout_mask, memory)" ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "def sample_batch(rollout_length=10):\n", - " prev_mem = pool.prev_memory_states\n", - " rollout_obs, rollout_actions, rollout_rewards, rollout_mask = pool.interact(\n", - " rollout_length)\n", + "# Train \n", "\n", - " feed_dict = {\n", - " observations_ph: rollout_obs,\n", - " actions_ph: rollout_actions,\n", - " rewards_ph: rollout_rewards,\n", - " mask_ph: rollout_mask,\n", - " }\n", - " for placeholder, value in zip(initial_memory_ph, prev_mem):\n", - " feed_dict[placeholder] = value\n", - " return feed_dict" + "just run train step and see if agent learns any better" ] }, { @@ -480,42 +579,50 @@ "cell_type": "code", "execution_count": null, "metadata": {}, - "outputs": [ - { - "data": { - "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYAAAAD8CAYAAAB+UHOxAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3XlcVPe9+P/Xh2ETBRFFRAbFBaMigkDcErOYaFySmEbJ\n0iymN439fbO0vWnTLDe3adPkNm2atOm9TZs9ms3EJWqiidEYq4m4gIL7ggqyKQiKILLO5/fHOSgq\nCsowh5l5Px+PeTDzOWdm3gfxvM/y+XzeSmuNEEII7+NjdQBCCCGsIQlACCG8lCQAIYTwUpIAhBDC\nS0kCEEIILyUJQAghvJQkACGE8FKSAIQQwktJAhBCCC/la3UAF9OjRw8dExNjdRhCCOFWMjIyjmqt\nw1tar0MngJiYGNLT060OQwgh3IpSKrc168klICGE8FKSAIQQwktJAhBCCC/Voe8BNKeuro78/Hyq\nq6utDkVYLDAwELvdjp+fn9WhCOGW3C4B5OfnExwcTExMDEopq8MRFtFaU1paSn5+Pv369bM6HCHc\nkttdAqqurqZ79+6y8/dySim6d+8uZ4JCtIHbJQBAdv4CkL8DIdrKLROAEMJ7ZBdXsHLnEavD8EiS\nANxQTEwMR48etToMIVziqQXb+P8+zKC0ssbqUDyOJIA20lrjcDja7fPr6+vb7bOF6Oi2F5STnnuM\neodmUWah1eF4HEkAlyEnJ4chQ4bw8MMPk5SUxAcffMCYMWNISkoiNTWVyspKNm7cyO233w7A4sWL\n6dSpE7W1tVRXV9O/f38A3nrrLa688koSEhKYPn06VVVVADzwwAM8/vjjXH/99Tz55JOUlpYyceJE\nRowYwc9+9jO01gCcPHmSqVOnkpCQwLBhw/j000+t+YUI0U4+SMulk5+NKyKCmZeed/pvXziH23UD\nber3X+xgZ+EJp37m0N4hPHdLXIvr7dmzh/fee4/nn3+e22+/nZUrV9K5c2f+9Kc/8eqrr/LMM8+w\nZcsWANauXcuwYcPYtGkT9fX1jBo1CoDbb7+dhx56CIBnn32Wd955h8ceewyAvXv3snLlSmw2Gz//\n+c+5+uqr+e1vf8vSpUt58803Afj666/p3bs3S5cuBaC8vNypvwshrHS8qpZFmQVMT7YzJDKE/160\nnR2FJxgW1dXq0DyGnAFcpr59+zJ69GjWr1/Pzp07ueqqq0hMTGT27Nnk5ubi6+vLwIED2bVrFxs3\nbuTxxx9nzZo1rF27lnHjxgGwfft2xo0bR3x8PB999BE7duw4/fmpqanYbDYA1qxZw7333gvA1KlT\n6datGwDx8fGsXLmSJ598krVr19K1q/zHEJ7js/Q8auod3D+mL7cO742/rw+fpedZHZZHceszgNYc\nqbeXzp07A8Y9gAkTJvDJJ5+ct864ceP46quv8PPz48Ybb+SBBx6goaGBv/zlL4BxqWfRokUkJCTw\n/vvvs3r16vM+v1FzXR4HDRpERkYGy5Yt4+mnn2bixIn89re/deJWCmGNBofmg/W5jOoXxuBeIQDc\nFNeLxZmFPDNlCIF+Nosj9AxyBtBGo0eP5ocffiA7OxuAqqoq9u7dC8A111zD3/72N8aMGUN4eDil\npaXs3r2buDgjcVVUVBAZGUldXR0fffTRBb/jmmuuOb38q6++4tixYwAUFhYSFBTEvffey69//Ws2\nb97cnpsqhMus3lNMXtkpZo6NOd12R4qd8lN1rNwlXUKdxa3PADqC8PBw3n//fe6++25qaoxuai+8\n8AKDBg1i1KhRHDlyhGuuuQaA4cOH07Nnz9NH83/4wx8YNWoUffv2JT4+noqKima/47nnnuPuu+8m\nKSmJa6+9lj59+gCwbds2nnjiCXx8fPDz8+Of//ynC7ZYiPY3Oy2XXiGBTBgacbpt7IAe9O4ayLz0\nfG4e3tvC6DyH6sh31VNSUvS5BWF27drFkCFDLIpIdDTy9+B5DpRUMv6Vf/OrCYN47IbYs5a98s0e\n/u+7bNY9NZ7Irp0sirDjU0plaK1TWlpPLgEJITqUD9bn4mdT3DWyz3nLZiTb0RoWbi6wIDLPIwlA\nCNFhnKypZ356PlPjIwkPDjhved/unRnVL0zGBDiJJAAhRIfx+ZYCKmrqub/Jzd9zpaZEk1NaRXru\nMdcF5qEkAQghOgStNXPScoiP6sqI6NALrjclvhed/W18tknGBLRVqxOAUsqmlNqilPrSfN1PKbVB\nKbVPKfWpUsrfbA8wX2eby2OafMbTZvsepdRNzt4YIYT7Wn+gjL1HKrl/TN+LTvUd5O/L1OGRLN1W\nxMkamSurLS7lDOAXwK4mr/8E/FVrHQscAx402x8EjmmtBwJ/NddDKTUUuAuIAyYBryulZDSHEAKA\nOWk5dAvy45aElrt4pqZEU1XbwLJtRe0fmAdrVQJQStmBqcDb5msFjAfmm6vMBm4zn08zX2Muv8Fc\nfxowV2tdo7U+CGQDI52xEUII91Z4/BTf7DzCnVf2adUo35S+3ejXozPzMvJdEJ3nau0ZwN+A3wCN\n8x53B45rrRvPv/KBKPN5FJAHYC4vN9c/3d7Me05TSs1SSqUrpdJLSkouYVO8h7vUA3A4HLz99ttc\nffXVJCQkMGHCBL788suz1pk3bx5xcXH4+Phw7piPP/7xjwwcOJArrriC5cuXuzJ04WIfbziE1pp7\nRp3f9bM5SilmJNvZeLCMnKMn2zk6z9ViAlBK3QwUa60zmjY3s6puYdnF3nOmQes3tdYpWuuU8PDw\nlsKzVHvXAgD3rQegteaee+5h+/btLFiwgKysLN5//30+/PBDXnvttdPrDRs2jIULF54eLd1o586d\nzJ07lx07dvD111/z8MMP09DQ4OrNEC5QU9/AJxsPccOQCKLDglr9vulJdnwUzJezgMvWmqkgrgJu\nVUpNAQKBEIwzglCllK95lG8HGqs15APRQL5SyhfoCpQ1aW/U9D2X56un4PC2Nn3EeXrFw+SXLrg4\nJyeHyZMnc/3115OWlsaiRYvYs2cPzz33HDU1NQwYMID33nuPnTt38tJLL7Fw4UIWL17MXXfdRXl5\nOQ6Hg6FDh3LgwAHeeust3nzzTWpraxk4cCAffPABQUFBPPDAA4SFhbFlyxaSkpJ45plnuPvuuykp\nKWHkyJFn1QO44447yM/Pp6Ghgf/+7//mzjvvPCve/fv388gjj1BSUkJQUBBvvfUWsbGxxMbGsn//\nfsrLywkLC2P16tVcc801jBs3jvfee48PP/yQgwcPUlRUxN69e3n11VdZv349X331FVFRUXzxxRf4\n+fnx/PPP88UXX3Dq1CnGjh3LG2+8gVKK2bNn07dvX1566czvMioqio8//pibbrqJGTNmEBUVdcFR\nvI2/s4CAAPr168fAgQPZuHEjY8aMccI/suhIlm0rovRkLTPHxFzS+3p1DWRcbDgLNufznxMGYfOR\nGtGXqsUzAK3101pru9Y6BuMm7iqt9T3Ad8AMc7WZwGLz+RLzNebyVdrYYy0B7jJ7CfUDYoGNTtsS\nF9qzZw/3338/W7ZsoXPnzrzwwgusXLmSzZs3k5KSwquvvkpSUlKz9QA2bNhwVj2ATZs2kZWVxZAh\nQ3jnnXdOf0djPYBXXnmF3//+91x99dVs2bKFW2+9lUOHDgFn6gFkZWWxfft2Jk2adF6ss2bN4n//\n93/JyMjgL3/5Cw8//DA2m41Bgwaxc+dOvv/+e5KTk1m7di01NTXk5+czcOBAwEgeS5cuZfHixdx7\n771cf/31bNu2jU6dOp2uQfDoo4+yadMmtm/fzqlTp05f4pkzZw7PPPMMJSUlTJkyhbFjx/LEE08w\nb948HnnkkRaL1xQUFBAdfeZ4wW63U1Agoz890ex1ufQP78xVA7tf8nvvSImmqLyaH7I7/iXRjqgt\nk8E9CcxVSr0AbAEa917vAB8opbIxjvzvAtBa71BKfQbsBOqBR7TWbTunv8iRentqrAUAnFUPAKC2\ntpYxY8ZcsB5AQ0PDWfUAnn32WY4fP05lZSU33XSmZ+y59QAWLlwInF8P4Ne//jVPPvkkN9988+nP\nbVRZWcm6detITU093dY4Yd24ceNYs2YNBw8e5Omnn+att97i2muv5corrzy97uTJk/Hz8yM+Pp6G\nhobTCSY+Pp6cnBwAvvvuO/785z9TVVVFWVkZcXFx3HLLLdTX1xMSEsJ//ud/MmvWLG655RZmzJhB\nXFwcw4cPZ8WKFRf9HTc3yvNiXQOFe8rKO05m3nF+f2vcZf373ji0J6FBfszLyOeaQR37knFHdEkJ\nQGu9GlhtPj9AM714tNbVQOq57eayF4EXLzXIjqbpXP0duR6Aw+EgNDSUzMzMZmP717/+RWFhIc8/\n/zwvv/zy6ctAjQICjKH4jbONNsbg4+NDfX091dXVPPzww6SnpxMdHc3vfvc7qqurAU4nr927d/PH\nP/4Rm83GxIkTASguLqZnz54X+Q0bR/x5eWf6DOTn59O7t8wA6WnmpOXS2d/G7Unn9QdplQBfG9MS\nevPJpjzKq+roGuTn5Ag9m4wEbqOOXA8gJCSEfv36MW/ePMBIVllZWQCMGjWKdevW4ePjQ2BgIImJ\nibzxxhvnnUVcTOPOvkePHlRWVjJ//vyzlldUVHDFFVfwzTff4HA4WLFiBdXV1bzyyivn3as41623\n3srcuXOpqanh4MGD7Nu3j5EjpdewJymtrOGLrYVMT7YTHHj5O+7UlGhq6x0syZJLhJdKEkAbNa0H\nMHz4cEaPHs3u3bsBmq0HMHz48PPqAUyYMIHBgwdf8Duee+451qxZQ1JSEt98881Z9QBGjhxJYmIi\nL774Is8+++x57/3oo4945513SEhIIC4ujsWLjVs1AQEBREdHn76UNW7cOCoqKoiPj2/1toeGhvLQ\nQw8RHx/Pbbfddtblo7vvvpvf/va3PP3007z++utcffXVxMbGMnfuXB555JHT2/v5559jt9tJS0tj\n6tSppy+DxcXFcccddzB06FAmTZrEP/7xj9NnFcIzfJqeR61Z8rEt4nqHMCQyRMYEXAapByDahcPh\nYPr06SQmJvL4448THBxMSUkJCxcu5MEHH8TX1zm1iOTvwT3VNzi49uXVxPQI4qOfjm7z5737/UGe\n/3InX/9y3OkSkt5M6gEIS/n4+DB//nzCwsK46aabSEpK4ic/+QmxsbFO2/kL9/Xt7mIKjp/i/kvs\n+nkht42Iws+mmJcuZwGXwi3/J2qtpUeIG7DZbDz22GM89thj7fL5HfnsVVzcnLQcokI7ccPgi3cG\naK2wzv7cMDiCRVsKeGryYPxscmzbGm73WwoMDKS0tFT+83s5rTWlpaUEBgZaHYq4RNnFFfyQXco9\no/vg68QddWqKndKTtazaXey0z/R0bncGYLfbyc/PR+YJEoGBgdjtdqvDEJdo9rpc/H19uDMluuWV\nL8G1g8IJDw5gXno+N8X1cupneyq3SwB+fn7069fP6jCEEJfhRHUdCzbnc8vw3nTvcn7Jx7bwtflw\ne1IUb689SHFFNT2D5eywJW53CUgI4b4WZuRTVdvAAxcp+dgWqcnRNDg0i7bImIDWkAQghHAJh0Mz\nJy2XEX1Cibd3bZfvGNizCyP6hDIvPV/uE7aCJAAhhEv8sP8oB46evORZPy/VHSnR7CuuJCu/vF2/\nxxNIAhBCuMTsdbn06OLP5Pj2vUF78/BIAv18mJcuReNbIglACNHu8sqq+Hb3Ee4e2YcA3/ad0iM4\n0I/JwyJZklVIdZ0UEboYSQBCiHb34YZcfJTix60s+dhWqcl2KqrrWb7jsEu+z11JAhBCtKvqugY+\n3ZTHTXERRHbt5JLvHN2/O/ZunWRqiBZIAhBCtKslWYUcr6pz2rw/reHjYxSN/2H/UfKPVbnse92N\nJAAhRLvRWjN7XQ5XRAQzql+YS797epIdrWFBhowJuJAWE4BSKlAptVEplaWU2qGU+r3Z/r5S6qBS\nKtN8JJrtSin1d6VUtlJqq1IqqclnzVRK7TMfMy/0nUIIz7D50HF2FJ7g/rF9XT6BY3RYEGMHdGf+\n5jwcDhkT0JzWnAHUAOO11glAIjBJKdU4gfcTWutE89FYd3AyRsH3WGAW8E8ApVQY8BwwCqOU5HNK\nqW7O2xQhREczJy2H4EBfbku8vJKPbXVHSjR5ZafYcLDMku/v6FpMANpQab70Mx8XS6fTgDnm+9YD\noUqpSOAmYIXWukxrfQxYAUxqW/hCiI6quKKaZduKSE2OpnOANdOO3RTXi+AAX+ZlyJiA5rTqHoBS\nyqaUygSKMXbiG8xFL5qXef6qlGqc2SkKaPrbzjfbLtQuhPBAczfmUdegua+NJR/bopO/jZsTerNs\nWxEV1XWWxdFRtSoBaK0btNaJgB0YqZQaBjwNDAauBMKAJ83Vm7vQpy/Sfhal1CylVLpSKl2mfBbC\nPdU1OPhoQy7XDgqnX4/OlsaSmmKnus7B0q1FlsbREV1SLyCt9XFgNTBJa11kXuapAd7DuK4PxpF9\n04m+7UDhRdrP/Y43tdYpWuuU8PDwSwlPCNFBfLPjCEdO1DBzrHVH/41GRIcyILyzFI1vRmt6AYUr\npULN552AG4Hd5nV9lHFr/zZgu/mWJcD9Zm+g0UC51roIWA5MVEp1M2/+TjTbhBAeZnZaDn3Cgrh2\nkHNKPraFUoo7UqLJyD3G/pLKlt/gRVpzBhAJfKeU2gpswrgH8CXwkVJqG7AN6AG8YK6/DDgAZANv\nAQ8DaK3LgD+Yn7EJeN5sE0J4kF1FJ9h4sIz7RvfF5tMxanf/KCkKm49ivpwFnKXFW/Na663AiGba\nx19gfQ08coFl7wLvXmKMQgg3Mictl0A/H1JTOk65zp7BgVw3KJwFGfn8asIgp9YidmfyWxBCOE15\nVR2LthRwW2IUoUH+VodzltQUO8UVNazdd9TqUDoMSQBCCKeZl5HHqboGS7t+Xsj4wRGEdfaXMQFN\nSAIQQjiFw6H5YH0uV8Z0I653+5R8bAt/Xx9uS4xixc4jlJ2stTqcDkESgBDCKf69r4Tc0iqXzvp5\nqVJT7NQ1aBZnygRxIAlACOEkc9bl0DM4gJvi2rfkY1sMiQxhWFSI1AkwSQIQQrRZztGTrN5bwo9H\n9cHft2PvVu5IiWZn0Ql2FErR+I79LyWEcAsfrs/FphQ/Humako9tcWtCb/xtPnIWgCQAIUQbVdXW\n81l6HpPjI+kZEmh1OC0KDfJnQlwEizILqKn37qLxkgCEEG2yOLOQE9X1zOyAXT8vJDXZzvGqOr7d\nVWx1KJaSBCCEuGyNJR+HRoaQ3Nd96juNiw2nV0gg89K9e0yAJAAhxGXbeLCM3YcrmGlByce2sPko\npidH8e+9JRw5UW11OJaRBCCEuGxz0nLp2smPWxPcr7bTjORoHBoWbvbeMQGSAIQQl+VweTVf7zjM\nnVdG08nfZnU4l6xfj85cGdONeel5GHNYeh9JAEKIy/LxhlwcWnPvKPe5+Xuu1ORoDhw9yeZDx6wO\nxRKSAIQQl6ymvoGPNx7ihsE96dM9yOpwLtuU4ZEE+du8dkyAJAAhxCX7evthjlbWduh5f1qjS4Av\nU+Ij+XJrEVW19VaH43KSAIQQl2z2uhz69+jM1QN7WB1Km6Um26msqeerbYetDsXlJAEIIS7Jtvxy\nNh86zn1j+uLTQUo+tsXIfmH07R7klXUCWlMUPlAptVEplaWU2qGU+r3Z3k8ptUEptU8p9alSyt9s\nDzBfZ5vLY5p81tNm+x6l1E3ttVFCiPYzJy2HIH8b05M7TsnHtlBKkZpsZ/2BMg6VVlkdjku15gyg\nBhivtU4AEoFJSqnRwJ+Av2qtY4FjwIPm+g8Cx7TWA4G/muuhlBoK3AXEAZOA15VS7td3TAgvduxk\nLYuzCrk9KYqQQD+rw3Ga25PsKAXzN3vXzeAWE4A2VJov/cyHBsYD88322cBt5vNp5mvM5TcoY4jg\nNGCu1rpGa30QyAZGOmUrhBAu8Wl6HrX1Dre/+Xuu3qGduHpgDxZk5ONweM+YgFbdA1BK2ZRSmUAx\nsALYDxzXWjfeNs8HGocCRgF5AObycqB70/Zm3iOE6OAaHJoP0nIZ0787gyKCrQ7H6VJToik4fop1\n+0utDsVlWpUAtNYNWutEwI5x1D6kudXMn83dFdIXaT+LUmqWUipdKZVeUlLSmvCEEC6wancxBcdP\nMXOs+w78upiJQyMICfT1qpvBl9QLSGt9HFgNjAZClVK+5iI7UGg+zweiAczlXYGypu3NvKfpd7yp\ntU7RWqeEh4dfSnhCiHY0Jy2HyK6B3DgkwupQ2kWgn41piVF8vf0w5afqrA7HJVrTCyhcKRVqPu8E\n3AjsAr4DZpirzQQWm8+XmK8xl6/SxkQbS4C7zF5C/YBYYKOzNkQI0X6yiytZu+8o947ui6/Nc3uP\np6bYqal38OXW845NPVJr/iUjge+UUluBTcAKrfWXwJPA40qpbIxr/O+Y678DdDfbHweeAtBa7wA+\nA3YCXwOPaK29uxyPEG7iw/W5+Nt8uPPK6JZXdmPxUV25IiKYz7xkagjfllbQWm8FRjTTfoBmevFo\nrauB1At81ovAi5cephDCKpU19czPyOfm4ZH06BJgdTjtSilFaoqdF5buYt+RCmI98GZ3U557LieE\ncIrPN+dTWVPP/WNjrA7FJW4bEYWvj2JehuefBUgCEEJckNaa2Wm5JNi7khgdanU4LtGjSwDjB/dk\n4eYC6hocVofTriQBCCEuKG1/KdnFlR438KslqSnRHK2sYfUez+6KLglACHFBs9NyCOvsz9ThkVaH\n4lLXXRFOjy7+Hl80XhKAEKJZBcdPsWLnEe66MppAP++atsvP5sPtSXZW7S7maGWN1eG0G0kAQohm\nfbQ+F4B7RnvmyN+WpCbbqXdoFm3x3KLxkgCEEOeprmtg7qY8JgyNICq0k9XhWCI2IpiE6FDmped7\nbNF4SQBCiPMs3VpE2claZnrZzd9zpSbb2XOkgm0F5VaH0i4kAQghzjMnLYeBPbswZkB3q0Ox1C0J\nvQnw9fHYovGSAIQQZ8nMO05Wfjkzx/TFKOXhvbp28mPSsF4sziygus7zZq6RBCCEOMucdTl0CfDl\nR0meUfKxrVKTozlRXc+KnUesDsXpJAEIIU47WlnDl1uLmJFsp0tAi1OFeYWxA7oTFdqJzzxwTIAk\nACHEaZ9uyqO2wcG9Xtr1szk+PorpSVF8n32UwuOnrA7HqSQBCCEAqG9w8OH6XMbF9mBgzy5Wh9Oh\nzEiORmtY6GFF4yUBCCEAWLHzCEXl1V43709r9OkexOj+YczL8KwxAZIAhBCAMe9PVGgnxg/uaXUo\nHVJqcjS5pVVsPFhmdShOIwlACMGewxWsP1DGfWP6YvPx7q6fFzI5vhddAnw9qk6AJAAhBHPScgjw\n9eHOFM8u+dgWQf6+3Dw8kmXbijhZU291OE7RmqLw0Uqp75RSu5RSO5RSvzDbf6eUKlBKZZqPKU3e\n87RSKlsptUcpdVOT9klmW7ZS6qn22SQhxKUoP1XHws0FTEvsTbfO/laH06Glptipqm1g6bYiq0Nx\nitacAdQDv9JaDwFGA48opYaay/6qtU40H8sAzGV3AXHAJOB1pZRNKWUD/gFMBoYCdzf5HCGERRZk\n5HOqrkFu/rZCUp9u9A/v7DF1AlpMAFrrIq31ZvN5BbALiLrIW6YBc7XWNVrrg0A2RvH4kUC21vqA\n1roWmGuuK4SwiMOh+WB9Lsl9uzEsqqvV4XR4SilmJNvZlHOMg0dPWh1Om13SPQClVAwwAthgNj2q\nlNqqlHpXKdXNbIsCmqbHfLPtQu3nfscspVS6Uiq9pMSzy7EJYbW12Uc5ePQk94+RgV+tNT3Jjo+C\n+RnufxbQ6gSglOoCLAB+qbU+AfwTGAAkAkXAK42rNvN2fZH2sxu0flNrnaK1TgkPD29teEKIyzBn\nXQ49ugQweZh3lXxsi4iQQK4dFM6CjAIaHO49JqBVCUAp5Yex8/9Ia70QQGt9RGvdoLV2AG9hXOIB\n48i+aVcCO1B4kXYhhAUOlVaxak8xPx7VB39f6RB4KVJTojl8oprvs49aHUqbtKYXkALeAXZprV9t\n0t70kOFHwHbz+RLgLqVUgFKqHxALbAQ2AbFKqX5KKX+MG8VLnLMZQohL9eGGXGxKcc+oPlaH4nZu\nGNKT0CA/t58grjXT/V0F3AdsU0plmm3PYPTiScS4jJMD/AxAa71DKfUZsBOjB9EjWusGAKXUo8By\nwAa8q7Xe4cRtEUK00qnaBj7dlMdNw3oRERJodThuJ8DXxm2JUXy84RDHq2oJDXLP7rMtJgCt9fc0\nf/1+2UXe8yLwYjPtyy72PiGEayzJKqD8VJ3Xl3xsi9QUO++vy2FJVqHbdqGVC39CeBmtNbPX5TK4\nVzBXxnRr+Q2iWXG9uzI0MsSty0VKAhDCy2TkHmNn0Qlmjo3x+pKPbZWaYmdbQTm7ik5YHcplkQQg\nhJeZnZZLSKAv0xJ7Wx2K25uWGIWfTbntWYAkACG8SPGJar7aVsQdKdEE+UvJx7YK6+zPhKERLMos\noLbeYXU4l0wSgBBe5OONh2jQWko+OlFqcjRlJ2tZtbvY6lAumSQAIbxEbb2DjzYc4rpB4cT06Gx1\nOB5jXGwPegYHuOUEcZIAhPASy3ccpqSihvvHxlgdikfxtflwe5Kd1XtLKK6otjqcSyIJQAgvMSct\nh77dg7g2VubYcrbUFDsNDs3nmwusDuWSSAIQwgvsKCxnU84x7hvdFx8p+eh0A8K7kNy3m9sVjZcE\nIIQX+CAtl05+NlKTpeRje0lNtpNdXElm3nGrQ2k1SQBCeLjjVbUsyizgthFRdA3yszocjzV1eCSB\nfj585kZjAiQBCOHh5qXnU13nkKIv7Sw40I8pwyL5MquQU7UNVofTKpIAhPBgDWbJx5H9whgSGWJ1\nOJdOa9jwBnzyY1j1Auz6Ao7lGu0dUGpKNBU19SzfcdjqUFpFhgIK4cH+vbeYQ2VVPDlpsNWhXLpT\nx2DRI7BnKXSNhr1fgzaPrANDITLh7EfYAPCx9ph2VL8wosM6MS8jj9tGXKx0escgCUAIDzZ7XS4R\nIQFMjIvQzZKnAAAd80lEQVSwOpRLU7gFPpsJJwph0p9g1M+gvhqO7ISiTDi8FYqyYMO/oKHWeI9/\nF+gVD72Gn0kK4VeAzXX3PXx8FDOSovnryr3klVURHRbksu++HJIAhPBQB4+e5N97S3h8wiD8bG5y\ntVdr2PQ2LH8GukTAf3wN9hRjmV8nsCcbj0YNdVCyG4rMhFCUBVs+hI1vGMttARAx9OwzhZ5x4Nd+\nRXCmJ0fxt2/3smBzPr+8cVC7fY8zSAIQwkPNScvBz6a4a6SbdP2sqYAvfgHbF0DsRPjRGxAUdvH3\n2PzMo/54GHGP0eZogLIDZkLINH7uWAQZ7xvLlQ3CB5sJwTxb6BUPAcFO2Qx7tyCuGtCD+Rn5/Hx8\nbIced9FiAlBKRQNzgF6AA3hTa/2aUioM+BSIwSgJeYfW+phZQ/g1YApQBTygtd5sftZM4Fnzo1/Q\nWs927uYIIQBO1tQzPz2fKfGR9Ax2g5KPR3bAZ/cbO+4bnoOrfnn51/N9bNAj1njEzzDatIbjh86c\nJRzeCtkrIetj800Kug84+/JRZELLCegCUlPs/GJuJusPljJ2QI/L2w4XaM0ZQD3wK631ZqVUMJCh\nlFoBPAB8q7V+SSn1FPAU8CQwGaMQfCwwCvgnMMpMGM8BKRh1hDOUUku01secvVFCeLvPtxRQUVPv\nHqUKt3wES38FgV1h5hcQc7Xzv0Mp6NbXeAy99Ux7xeEml48yIT8ddiw8s7xr9NkJoddwCO5lfN5F\n3BTXi+BAX+al57t3AtBaFwFF5vMKpdQuIAqYBlxnrjYbWI2RAKYBc7QxHnq9UipUKRVprrtCa10G\nYCaRScAnTtweIbye1po5aTkMiwohqU+o1eFcWG0VLHsCMj+EftfC9LehS0/XxhDcy3gMmnimrars\nzE3moiwjQexeinHcCnTuefblo8gECO17VlII9LNxS0JvFm7O5/fT4ggJ7JgD8C7pHoBSKgYYAWwA\nIszkgNa6SCnV+C8XBTSdFzXfbLtQuxDCidYfKGPvkUr+PGN4xy35eHSf0cuneCdc+6Tx8LFZHZUh\nKAz6X2c8GtVUwOHtZy4fFWXB/lVNuqV2bXL5KBEiE7gjqTcfbzjE0q1F3D2yj+u3oxVanQCUUl2A\nBcAvtdYnLvKH1dwCfZH2c79nFjALoE+fjvlLE6Ijm5OWQ2iQH7cmdNCSj9sXwJKfg28A3LsABt5g\ndUQtCwiGvmOMR6O6aiOBnT5TyIKNb0FDDQAJfkEs7dyH3O9iwW+SkSDCB4Ovv0Ubcb5WJQCllB/G\nzv8jrXXjBbIjSqlI8+g/Emgsh5MPNO12YAcKzfbrzmlffe53aa3fBN4ESElJ6ZjD/YTooAqPn+Kb\nnUf46bh+BPp1kCPqRvU1RvfOTW9D9GiY8S50deOLAH6BEJVkPBo11MHRvVCUhSrKImz3BvoeXwGL\nvzCW2/yh59Aml5ASISLO6OJqgdb0AlLAO8AurfWrTRYtAWYCL5k/Fzdpf1QpNRfjJnC5mSSWA/+j\nlOpmrjcReNo5myGEAPh4wyG01tw7qoPN+3Msx7jkU5QJY38ON/zWpQO0XMbmZ+zQI+Ig8cfYrq4m\n8Y8reSLFj5/FVpy5fLRrCWw2O0EqG/QYdM7N5ngIbP+pO1pzBnAVcB+wTSmVabY9g7Hj/0wp9SBw\nCEg1ly3D6AKajdEN9CcAWusypdQfgE3mes833hAWQrRdTX0Dn2w8xA1DIjrWCNTdS2HR/zOe3/UJ\nDJ5ibTwu1DM4kOuuiODtXeU8OO12fJt2Sy3PO3sA24HVsHXumTf3vw7uX9zMpzpPa3oBfU/z1+8B\nzrt4Z/b+eeQCn/Uu8O6lBCiEaJ1l24ooPVnLzI7S9bOhDlb+DtL+D3qPgNT3oVuMxUG5XmpKNCt3\nFbNmXwnjB5tTcigFoX2Mx5Cbz6xcccQ8S8gE3/a/LCQjgYXwELPX5dI/vDNXDexudShQXgDzfwJ5\nG2DkLJj4gnHT1wuNH9yT7p39mZeefyYBXEhwBARPgNgJLonNTSYIEUJcTFbecTLzjjNzTIz1XT+z\nv4U3xhmje2e8C1Ne9tqdP4CfzYfbRkSxctcRyk7WWh3OWSQBCOEB5qTl0tnfxu1JFvaqcTTAqhfh\nw+nQpRfM+jcMm25dPB1IaoqdugbNoi0dq2i8JAAh3FxpZQ1fbC1kerKdYKtGnFYcgTnTYM2fjUnZ\nfroSegy0JpYOaHCvEIbbuzIvo2OVi5QEIISb+zQ9j9p6C0s+5nxvXPLJT4dpr8O0f4B/B+qF1EGk\nJtvZVXSC7QXlVodymiQAIdxYfYODj9Yf4qqB3RnY0znTGbeawwFrX4HZt0BACDy06syUzOI8tyZE\n4e/rw7z0vJZXdhFJAEK4sW93F1Nw/JTrZ/2sKoNP7oRvn4e4H8Gs74zCK+KCugb5MXFoBIuzCqmp\n7xhF4yUBCOHG5qTlEBXaiRsGu3AWzbxN8K9xxsClqa/A9HecVkzF092REs3xqjpW7ixueWUXkAQg\nhJvKLq7gh+xS7hndB19XlHzUGtJeh/cmGTN3PvgNXPnTFufGF2dcNbAHkV0DmZfRMS4DSQIQwk3N\nScvF39eHO1NcUPKxuhw+uw+WPw2DJsHP1hije8UlsfkopifZWbO3hMPl1VaHIwlACHdUUV3Hgox8\nbhnem+5d2nmQVVEWvHEt7PkKJr4Id34InTpwoZkObkayHYeGBZut7xIqCUAIN7RwcwEnaxuYObYd\nu35qDenvwtsToKEWHlgGYx+VSz5tFNOjMyP7hTE/Ix9j6jTrSAIQws1orZmdlkNidCjD7e10JF5T\nCQtnwZf/Cf3Gwc/WQp9R7fNdXig12c7BoyfJyLW2JLokACHczA/ZpRwoOdl+R//Fu+Gt8bB9Pox/\nFn48Dzp3gAnmPMiU+EiC/G3MS7f2MpAkACHczOy0HLp39mdKfKTzPzxrLrx1PZw6ZsxFf80T4CO7\nCWfrHODL1PhIvtxaSFVtvWVxyHTQwm3VNTjYcug49Q0Oq0NxmYqaer7ddYSHrxtIgK8TSz7WnYKv\nfgOb50Dfq2HGOxDcy3mfL86TmhLNvIx8lm07zIxkuyUxSAIQbsfh0HyxtZBXV+wlt7TK6nBczs+m\n+PGoPs77wNL9RrnGI9tg3K/huqfBJruG9nZlTDdiugcxLz1PEoAQLdFa8+2uYv7yzR52H65gSGQI\n//fjEYS3dzfIDqZHcAC9Q51ULWrHIlj8qFHL9p75LitEIkApRWpKNC8v30Nu6Un6du/s8hhaUxT+\nXeBmoFhrPcxs+x3wEFBirvaM1nqZuexp4EGgAfi51nq52T4JeA2wAW9rrV9y7qYIT7b+QCkvL99D\nRu4xYroH8fe7R3BzfCQ+PtIl8bLU18KK/4YN/wL7lUa5xq7WHIV6s9uTonjlmz3Mz8jnVxOvcPn3\nt+YM4H3g/4A557T/VWv9l6YNSqmhwF1AHNAbWKmUGmQu/gcwAcgHNimllmitd7YhduEFtuWX8/I3\ne1izt4SIkAD+50fxpKbY8XPF1Aee6vghmPcAFGTA6Efgxt+Br7/FQXmnyK6duDo2nAUZ+fzyxkHY\nXHxA05qi8GuUUjGt/LxpwFytdQ1wUCmVDYw0l2VrrQ8AKKXmmutKAhDNyi6u5NUVe1i27TChQX48\nM2Uw94+JIdDPiTc+vdGer+HznxmDvO78EIbcYnVEXu+OFDuPfryFdfuPMi423KXf3ZZ7AI8qpe4H\n0oFfaa2PAVHA+ibr5JttAHnntMuoEnGeguOneG3lXuZn5BPoZ+Pn4wfy02v6E2JVpStP0VAPq/4A\nP/wNeg2HO2ZDWH+roxLAjUMi6NrJj3np+W6TAP4J/AHQ5s9XgP8Amjt/0TQ/3qDZMdBKqVnALIA+\nfZzY00F0aEcra/jHd9l8tP4QAA+M7cfD1w+gh5fd4G0XJwph/oNwaB2k/Afc9EfwC7Q6KmEK9LMx\nLbE3czflUV5VR9cg1x3sXFYC0FofaXyulHoL+NJ8mQ80nZrQDhSazy/Ufu5nvwm8CZCSkmLtRBmi\n3Z2oruPtNQd45/uDnKprIDU5mp/fGEuUs3q5eLv938GCnxr9/G9/G4anWh2RaEZqcjRz0nJZsrWQ\n+0a7rrTnZSUApVSk1rrIfPkjYLv5fAnwsVLqVYybwLHARowzg1ilVD+gAONG8Y/bErhwb9V1DcxJ\ny+H11fs5XlXH1PhIHp84iAHhXawOzTM4GmDNy7D6JQgfDHfMgfBBLb9PWGJYVAiDewUzPz2vYyUA\npdQnwHVAD6VUPvAccJ1SKhHjMk4O8DMArfUOpdRnGDd364FHtNYN5uc8CizH6Ab6rtZ6h9O3RnR4\ndQ0OPkvP4+/f7uPIiRquGRTOExOvIN7e1erQPEdlCSx8CA58Bwl3G1W7/F3fx1y0XuOYgD98uZO9\nRyoYFOGaCmvK6ulILyYlJUWnp6dbHYZwgsbRu39dsZec0iqS+oTym0mDGd1fJhlzqtx1MP8/jLl8\npvwFRtwr0ze7idLKGkb9z7f85KoY/mtq2+orK6UytNYpLa0nI4FFu9Jas2p3MS8vN0bvDu4VzDsz\nUxg/uCdKdkzO43DAur8bRdq7xcA986BXvNVRiUvQvUsANwzpyedbCvjNpMEuGesiCUC0mw3m6N30\n3GP07R7Ea3clcsvw3jJ619mqymDRw7D3Kxh6G9z6vxAYYnVU4jKkJkezfMcRvttdzMS49p+MTxKA\ncLrtBeW8vHwP/95bQs/gAF64bRh3Xhkto3fbQ36GMaq3oggmvwwjH5JLPm7suivCCQ8OYF5GviQA\n4V72l1Ty6oq9LN1aRGiQH09PNkbvdvKX0btOpzVsfBOW/xcER8KDyyEq2eqoRBv52ny4fUQU73x/\nkJKKGsKD23ccjCQA0WaFx0/x2sp9zN+cT4CvD4+NH8hDMnq3/VSfgCWPwc5FMGgS3PZPCAqzOirh\nJKkpdt5Yc4BFWwp46Jr2Ha0tCUBcttLKGl5fvZ8P1ueChvvH9OWR6wfK6F1nq6mAYznGo+wgZLxv\nPJ/wPIx5TCp2eZiBPYNJjA5lfka+JADR8VRU1/HW2oO8s/YAp+oamJ5k5xc3xmLvFmR1aO7J4YDK\nw2d28Mdy4NjBM6+rjp69fmhfeOBL6DvWgmCFK/zu1ji6dmr/M2hJAKLVqusa+CAtl9dXZ3Osqo7J\nw3rxq4mDGNjTNYNW3FpdNRzPbX4nfywH6qvPrKt8IMQOYTEweKrRrTOsn/GzWwx06ub6+IVLJUaH\nuuR7JAGIFtU1OJiXns/fv93H4RPVjIvtwRM3XcFwu2v+SN2C1lBVeuGj+Ipzpr7y62zs1LsPhIE3\nNtnJ94Ou0TI/v3AJSQDighwOzZfbivjrir0cPHqSEX1C+eudiYwZ4KWjdxvqoDyvmR18jvGztuLs\n9YMjjR17/+vOOYrvB517SHdNYTlJAOI8WmtW7ynh5eV72Fl0gisignnr/hRuHOIFo3eryy98FF+e\nD8bUVgZbAHTra+zU+449ewcf2gf85Z6I6NgkAYizbMop489f72ZTzjH6hAXxtzsTuSWht8tL1bUb\nh8O4HNPcDv5YDpwqO3v9oO7GTt1+JcSnnr2TD46UHjjCrUkCEIAxevcv3+xh9R5j9O4fbhvGnSnR\n+Pu64Q6utsq44Vp28Pwd/PFcaKg9s66yQWi0sUMfOu3sHXy3vhAos5QKzyUJwMsdMEfvfrm1iK6d\n/Hhq8mBmusvo3eN5cCgNyg6cvZOvPHz2ev7BRo+ankPgisln7+S7RoNN/hsI7yR/+V6qqNwYvTsv\nIx9/mw+PXm+M3nVF3+PLVlMJuT9A9rewfxWU7jMXKAjpbezQB95o7Oy7mT1qusUYo2Q9/d6FEJdB\nEoCXKTtZy+vfZTNnfS5aa+4bbYzebe85Ry6LwwFHtp3Z4R9aD4468O0EMVcZ9W37XWN0pZQat0Jc\nMkkAXqKypp631x7g7bUHqaqt50cj7PzyxliiwzpYT5WKw0Yd2/2rjIpWJ0uM9ohhMPr/wYDx0GeM\n7PCFcAJJAB6uuq6BD9fn8vrq/ZSdrGVSnDF6N9ZFJedaVFdtXMff/62x4z9ilpcO6mHs7AeMhwHX\nQ3D7T40rhLdpTU3gd4GbgWKt9TCzLQz4FIjBqAl8h9b6mDI6ib8GTAGqgAe01pvN98wEnjU/9gWt\n9Wznbopoqr7BwfyMfF77dh9F5cbo3V9PvIIEFw0xvyCtoWSPucNfBTk/QP0p8PGDPqPhxt8ZO/2I\neOliKUQ7a80ZwPvA/wFzmrQ9BXyrtX5JKfWU+fpJYDIQaz5GAf8ERpkJ4zkgBaOQfIZSaonW+piz\nNkQYHA7Nsu1FvPrNXg4cPUlidCiv3JHA2AE9rAuqqsy4nLN/lXGUf6LAaO8eC8kzjR1+36sgoIt1\nMQrhhVpMAFrrNUqpmHOapwHXmc9nA6sxEsA0YI42Ks2vV0qFKqUizXVXaK3LAJRSK4BJwCdt3gIB\nGDv+f+8t4S/f7GFH4QkGRXThzfuSmTA0wvWjdxvqIH+TscPP/hYKtwDa6FPf/zoY8KRxWSe0j2vj\nEkKc5XLvAURorYsAtNZFSqmeZnsUkNdkvXyz7ULtog201uwqqmBxVgFfZBZSWF5NdFgnXr0jgWmJ\nUa4dvVt2wOyt8x0cXGPMi6NsYE+B6542jvKjksDHDcYXCOElnH0TuLk9jr5I+/kfoNQsYBZAnz5y\nhNicQ6VVLMkqYHFmIfuKK/H1UVwzKJzfTBrMlPhI14zerT5h7Oj3rzKu5x/LMdpD+0D8DGOH3+8a\n6CQzhgrRUV1uAjiilIo0j/4jgWKzPR+IbrKeHSg02687p311cx+stX4TeBMgJSWl2SThjYorqlm6\ntYjFmYVk5h0HYGRMGC/cNowp8ZGEdW7n6YMdDVCYeWaHn7fRmBjNvwvEjIMxjxo7/bD+MuhKCDdx\nuQlgCTATeMn8ubhJ+6NKqbkYN4HLzSSxHPgfpVRjJYuJwNOXH7Z3OFFdx/Lth1mSVcgP2UdxaBga\nGcJTkwdzS0JvokI7tW8A5fnmDn8VHFgNp44BCiIT4OpfGjt8+0iZu14IN9WabqCfYBy991BK5WP0\n5nkJ+Ewp9SBwCEg1V1+G0QU0G6Mb6E8AtNZlSqk/AJvM9Z5vvCEszlZd18B3u4tZnFnIqj3F1NY7\n6BMWxCPXD+TWhN7t23+/tsqYaqHx5u3RPUZ7cCRcMcXY4fe/zpjLXgjh9pTRYadjSklJ0enp6VaH\n0e7qGxykHShlcWYhy7cfpqKmnh5dArh5eCTTEnuTGB3aPj15tDYGXjXu8A+lGTNl+gYa89sPGA8D\nbjAmUZPLOkK4DaVUhtY6paX1ZCSwRbTWZOYdZ3FmIV9uLeJoZQ3BAb5MGtaLaYlRjO4fhq+tHW7m\nVhafmWph/yo4ad6+6TkURs4y++SPBb92vrwkhLCcJAAX23ekgsWZhSzJKuRQWRX+vj7cMLgn0xJ7\nc90VPQn0c3I3yfoaYxK1xpu3h7cZ7UHdof/1Z6ZbCIl07vcKITo8SQAuUHD8FF9kFbI4s5BdRSfw\nUXDVwB48Nn4gNw3rRUigE6dg1hqO7juzw8/5HuqqwMcXokfDDb81dvi9EmSqBSG8nCSAdlJ2spZl\n24pYklnIxhzjfveIPqH87pahTB3e23nTL9dVQ/EOKMqCgs1Gb51yc8xd2AAYca+xw4+5GgI6yARw\nQogOQRKAE52sqWfFziMszixg7b6j1Ds0A3t24dcTB3FrQhR9urdx6uXqE8YlnKIsOLzV+Fmy50yh\n8sBQ6DcOxj1u7PS7xbR5m4QQnksSQBvV1jtYs7eExVmFrNh5mOo6B727BvLguH5MS4hiSGTw5fXg\nqSyBw1nGTr7I3NkfO3hmeZdeEDkcBk+FXsON56F9pbeOEKLVJAFcBodDs+FgGUuyCli27TDlp+ro\nFuTHjGQ70xKjSO7TDZ/WzsOjtXHJpnEnf3ir8byi8Mw63WKMnfyIeyAy0XgeHNEu2yaE8B6SAFpJ\na82OwhMszizgi6wiDp+oJsjfxsShEUxLjOLq2B74tdRt09EApfvNnXymsaM/vNUcYQsoH+gxyLiM\n02u4MeK2V7zMpyOEaBeSAFpw8OhJlmQWsjirgAMlJ/GzKa4d1JP/mjqEG4b0JMj/Ar/C+loo2XX2\nkf3h7VB30lhu8zf63g+51bh80ysBIuLAv4OVaBRCeCxJAM04cqKaL7IK+SKrkKz8cpSCUf3CeGhc\nfyYP60Vo0Dlz39SeNHbujTdmi7KgeJdRwByMCdN6xRs9ciITjB1++GCwObH7pxBCXCJJAKbyqjq+\n3mHMtpl2oBStIT6qK89OHcLNw3vTq6tZhLyqDA5sPXP5pijL6HffOLt1pzBjJz/mYfMSToIxQ6b0\nuRdCdDBenQCq6xr4dlcxizMLWL2nhNoGB/16dObn42O5NSGSAYGVxg4+c/6Z3jjlh858QEiUsZOP\nu/3MkX1IlPTEEUK4Ba9LAPUNDr7PPsqSzEKW7zjMydoGenbx5+cjbNwSUUafmnTU4a0wOwtOlpx5\nY9gAsCfDlf9x5gatzIophHBjXpEAtNZsPnSMxZmFfJ2VT+ipHFIC8ngj/AjxtkOElO9CbT8B2zGm\nTAgfDAMnnDmqjxgGgSFWb4YQQjiVRyeAvfklrF+/lsN7NhJ1ai/TbTk865OHf0CtsUJ5oLFzj59h\nXq8fbvTM8Qu0NnAhhHABj0wAhXn7qXnvdvo3HGKQcgBQGxiMT+8EfHtPMo7qIxOgeyzYPPJXIIQQ\nLfLIvV/PCDs7AiLYFXkjMXFjCO6XjH+3GLk5K4QQTXhkAvD1DyDhyW+sDkMIITq0NnVOV0rlKKW2\nKaUylVLpZluYUmqFUmqf+bOb2a6UUn9XSmUrpbYqpZKcsQFCCCEujzNGJ12vtU5sUn/yKeBbrXUs\n8K35GmAyEGs+ZgH/dMJ3CyGEuEztMTx1GjDbfD4buK1J+xxtWA+EKqWkDqEQQlikrQlAA98opTKU\nUrPMtgitdRGA+bOn2R4F5DV5b77Zdhal1CylVLpSKr2kpOTcxUIIIZykrTeBr9JaFyqlegIrlFK7\nL7Juc11w9HkNWr8JvAmQkpJy3nIhhBDO0aYzAK11ofmzGPgcGAkcaby0Y/4sNlfPB6KbvN0ONKl6\nIoQQwpUuOwEopTorpYIbnwMTMSZTWALMNFebCSw2ny8B7jd7A40GyhsvFQkhhHC9tlwCigA+N+vd\n+gIfa62/VkptAj5TSj0IHAJSzfWXAVOAbKAK+EkbvlsIIUQbKa077mV2pVQJkNuGj+gBHHVSOO7C\n27bZ27YXZJu9RVu2ua/WOryllTp0AmgrpVR6k/EJXsHbttnbthdkm72FK7ZZylQJIYSXkgQghBBe\nytMTwJtWB2ABb9tmb9tekG32Fu2+zR59D0AIIcSFefoZgBBCiAvwyASglJqklNpjTj39VMvvcG9K\nqXeVUsVKqe1Wx+IqSqlopdR3SqldSqkdSqlfWB1Te1NKBSqlNiqlssxt/r3VMbmCUsqmlNqilPrS\n6lhcpbmp9tvlezztEpBSygbsBSZgTD+xCbhba73T0sDakVLqGqASY7bVYVbH4wrmNCORWuvN5oj0\nDOA2D/93VkBnrXWlUsoP+B74hTm7rsdSSj0OpAAhWuubrY7HFZRSOUCK1rpdxz544hnASCBba31A\na10LzMWYitpjaa3XAGVWx+FKWusirfVm83kFsItmZpf1JOZU6pXmSz/z4VlHcOdQStmBqcDbVsfi\niTwxAbRq2mnhOZRSMcAIYIO1kbQ/83JIJsYkiyu01p6+zX8DfgM4rA7ExZqbat/pPDEBtGraaeEZ\nlFJdgAXAL7XWJ6yOp71prRu01okYs+mOVEp57CU/pdTNQLHWOsPqWCxwldY6CaOS4iPmZV6n88QE\nINNOewnzOvgC4COt9UKr43ElrfVxYDUwyeJQ2tNVwK3m9fC5wHil1IfWhuQaF5hq3+k8MQFsAmKV\nUv2UUv7AXRhTUQsPYt4QfQfYpbV+1ep4XEEpFa6UCjWfdwJuBC5WhMmtaa2f1lrbtdYxGP+PV2mt\n77U4rHZ3kan2nc7jEoDWuh54FFiOcWPwM631Dmujal9KqU+ANOAKpVS+ORW3p7sKuA/jqDDTfEyx\nOqh2Fgl8p5TainGgs0Jr7TVdI71IBPC9UioL2Ags1Vp/3R5f5HHdQIUQQrSOx50BCCGEaB1JAEII\n4aUkAQghhJeSBCCEEF5KEoAQQngpSQBCCOGlJAEIIYSXkgQghBBe6v8Hx0u6aIFv9tIAAAAASUVO\nRK5CYII=\n", - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - " 6%|▌ | 584/10000 [02:24<38:50, 4.04it/s]" - ] - } - ], - "source": [ - "for i in trange(5000):\n", - " sess.run(train_step, sample_batch())\n", + "outputs": [], + "source": [ + "for i in trange(15000):\n", + " memory = list(pool.prev_memory_states)\n", + " rollout_obs, rollout_actions, rollout_rewards, rollout_mask = pool.interact(10)\n", + " train_on_rollout(rollout_obs, rollout_actions,\n", + " rollout_rewards, rollout_mask, memory)\n", "\n", " if i % 100 == 0:\n", " rewards_history.append(np.mean(evaluate(agent, env, n_games=1)))\n", " clear_output(True)\n", - " plt.plot(rewards_history, label='rewards')\n", + " plt.plot(rewards_history, label='rewards', linewidth=3)\n", " plt.plot(moving_average(np.array(rewards_history),\n", - " span=10), label='rewards ewma@10')\n", - " plt.legend()\n", + " span=10), label='rewards ewma@10', linewidth=3)\n", + " plt.legend(fontsize=13)\n", + " plt.grid()\n", " plt.show()\n", " if rewards_history[-1] >= 10000:\n", " print(\"Your agent has just passed the minimum homework threshold\")\n", " break" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Relax and grab some refreshments while your agent is locked in an infinite loop of violence and death.\n", + "\n", + "__How to interpret plots:__\n", + "\n", + "The session reward is the easy thing: it should in general go up over time, but it's okay if it fluctuates ~~like crazy~~. It's also OK if it reward doesn't increase substantially before some 10k initial steps. However, if reward reaches zero and doesn't seem to get up over 2-3 evaluations, there's something wrong happening.\n", + "\n", + "\n", + "Since we use a policy-based method, we also keep track of __policy entropy__ - the same one you used as a regularizer. The only important thing about it is that your entropy shouldn't drop too low (`< 0.1`) before your agent gets the yellow belt. Or at least it can drop there, but _it shouldn't stay there for long_.\n", + "\n", + "If it does, the culprit is likely:\n", + "* Some bug in entropy computation. Remember that it is $ - \\sum p(a_i) \\cdot log p(a_i) $\n", + "* Your agent architecture converges too fast. Increase entropy coefficient in actor loss. \n", + "* Gradient explosion - just clip gradients and maybe use a smaller network\n", + "* Us. Or TensorFlow developers. Or aliens. Or lizardfolk. Contact us on forums before it's too late!\n", + "\n", + "If you're debugging, just run `logits, values = agent.step(batch_states)` and manually look into logits and values. This will reveal the problem 9 times out of 10: you'll likely see some NaNs or insanely large numbers or zeros. Try to catch the moment when this happens for the first time and investigate from there." + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -558,166 +665,31 @@ "\"\"\".format(video_names[-1])) # You can also try other indices" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### POMDP setting\n", - "\n", - "The atari game we're working with is actually a POMDP: your agent needs to know timing at which enemies spawn and move, but cannot do so unless it has some memory. \n", - "\n", - "Let's design another agent that has a recurrent neural net memory to solve this.\n", - "\n", - "__Note:__ it's also a good idea to scale rollout_len up to learn longer sequences. You may wish set it to >=20 or to start at 10 and then scale up as time passes." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "class SimpleRecurrentAgent:\n", - " def __init__(self, name, obs_shape, n_actions, reuse=False):\n", - " \"\"\"A simple actor-critic agent\"\"\"\n", - "\n", - " with tf.variable_scope(name, reuse=reuse):\n", - " # Note: number of units/filters is arbitrary, you can and should change it at your will\n", - " self.conv0 = Conv2D(32, (3, 3), strides=(2, 2), activation='elu')\n", - " self.conv1 = Conv2D(32, (3, 3), strides=(2, 2), activation='elu')\n", - " self.conv2 = Conv2D(32, (3, 3), strides=(2, 2), activation='elu')\n", - " self.flatten = Flatten()\n", - " self.hid = Dense(128, activation='elu')\n", - "\n", - " self.rnn0 = tf.nn.rnn_cell.GRUCell(256, activation=tf.tanh)\n", - "\n", - " self.logits = Dense(n_actions)\n", - " self.state_value = Dense(1)\n", - "\n", - " # prepare a graph for agent step\n", - " _initial_state = self.get_initial_state(1)\n", - " self.prev_state_placeholders = [tf.placeholder(m.dtype,\n", - " [None] + [m.shape[i] for i in range(1, m.ndim)])\n", - " for m in _initial_state]\n", - " self.obs_t = tf.placeholder('float32', [None, ] + list(obs_shape))\n", - " self.next_state, self.agent_outputs = self.symbolic_step(\n", - " self.prev_state_placeholders, self.obs_t)\n", - "\n", - " def symbolic_step(self, prev_state, obs_t):\n", - " \"\"\"Takes agent's previous step and observation, returns next state and whatever it needs to learn (tf tensors)\"\"\"\n", - "\n", - " nn = self.conv0(obs_t)\n", - " nn = self.conv1(nn)\n", - " nn = self.conv2(nn)\n", - " nn = self.flatten(nn)\n", - " nn = self.hid(nn)\n", - "\n", - " (prev_rnn0,) = prev_state\n", - "\n", - " # Apply recurrent neural net for one step here.\n", - " # See docs on self.rnn0(...).\n", - " # The recurrent cell should take the last feedforward dense layer as input.\n", - " \n", - "\n", - " logits = self.logits( )\n", - " state_value = self.state_value( )\n", - "\n", - " new_state = [new_rnn0]\n", - "\n", - " return new_state, (logits, state_value)\n", - "\n", - " def get_initial_state(self, batch_size):\n", - " \"\"\"Return a list of agent memory states at game start. Each state is a np array of shape [batch_size, ...]\"\"\"\n", - " # feedforward agent has no state\n", - " return [np.zeros([batch_size, self.rnn0.output_size], 'float32')]\n", - "\n", - " def step(self, prev_state, obs_t):\n", - " \"\"\"Same as symbolic state except it operates on numpy arrays\"\"\"\n", - " sess = tf.get_default_session()\n", - " feed_dict = {self.obs_t: obs_t}\n", - " for state_ph, state_value in zip(self.prev_state_placeholders, prev_state):\n", - " feed_dict[state_ph] = state_value\n", - " return sess.run([self.next_state, self.agent_outputs], feed_dict)\n", - "\n", - " def sample_actions(self, agent_outputs):\n", - " \"\"\"pick actions given numeric agent outputs (np arrays)\"\"\"\n", - " logits, state_values = agent_outputs\n", - " policy = np.exp(logits) / np.sum(np.exp(logits),\n", - " axis=-1, keepdims=True)\n", - " return [np.random.choice(len(p), p=p) for p in policy]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "agent = SimpleRecurrentAgent('agent_with_memory', obs_shape, n_actions)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Now let's train it!" - ] - }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], - "source": [ - "# A whole lot of your code here: train the new agent with GRU memory.\n", - "# - create pool\n", - "# - write loss functions and training op\n", - "# - train\n", - "# You can reuse most of the code with zero to few changes" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "```\n", - "\n", - "```\n", - "```\n", - "\n", - "```\n", - "```\n", - "\n", - "```\n", - "```\n", - "\n", - "```\n", - "```\n", - "\n", - "```\n", - "```\n", - "\n", - "```\n", - "```\n", - "\n", - "```\n", - "```\n", - "\n", - "```\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Homework assignment is in the second notebook: [url]" - ] + "source": [] } ], "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", "name": "python", - "pygments_lexer": "ipython3" + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.3" } }, "nbformat": 4,