forked from Pdbz199/Koopman-RL-Old
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathKoopman RL 5-5.py
248 lines (219 loc) · 8.35 KB
/
Koopman RL 5-5.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
'''======= OUTLINE ======='''
# LQR in Python
# Reformulation of LQR in framework
# VI for LQR
# Current issues with our implementation
# Ways forward
# Model Predictive Path Integral (MPPI) locally with Koopman dynamics
# Local LQR with
'''===== END OUTLINE ====='''
# IMPORTS
import observables
import numpy as np
import matplotlib.pyplot as plt
from scipy import integrate
from control import lqr
# Slow Manifold Example (KOOC/KRONIC)
## Model setup
'''
In this section, we study a deterministic dynamical system where we can explicitly find a finite dimensional linear embedding for the Koopman operator. This example comes from Steve's KRONIC paper. In particular, we consider a controlled system with quadratic nonlinearity that gives rise to a slow manifold:
\begin{align}
\frac{d}{dt}
\begin{bmatrix}
x_1\\
x_2
\end{bmatrix} =
\begin{bmatrix}
\mu x_1
\\
\lambda (x_2 - x_1^2)
\end{bmatrix}
+ \mathbf{D} \ u
\end{align}
Considering a transformation to a Koopman invariant subspace, $(y_1,y_2,y_3) = (x_1,x_2,x_1^2)$, we can express the system dynamics as
\begin{align}
\frac{d}{dt}
\begin{bmatrix}
y_1\\
y_2\\
y_3
\end{bmatrix} =
\begin{bmatrix}
\mu & 0 & 0
\\
0 & \lambda & -\lambda
\\
0 & 0 & 2\mu
\end{bmatrix}
\begin{bmatrix}
y_1\\
y_2\\
y_3
\end{bmatrix}
+
\underbrace{\begin{bmatrix}
1 & 0\\
0 & 1\\
2y_1 & 0
\end{bmatrix}
\mathbf{D}}_{D_y} \ u
\end{align}
'''
## Definitions
mu = -0.1
lamb = 1
tspan = np.arange(0, 50, 0.01)
x0 = np.array([
[-5],
[5]
])
A = np.array([
[mu, 0 ],
[0, lamb]
])
B = np.array([
[0],
[1]
])
Q = np.identity(2)
R = 1
A2 = np.array(
[[mu, 0, 0 ],
[0, lamb, -lamb],
[0, 0, 2*mu ]]
)
B2 = np.array(
[[0],
[1],
[0]]
)
Q2 = np.array(
[[1, 0, 0],
[0, 1, 0],
[0, 0, 0]]
)
## Calculate LQR controller
# C is supposed to be [0,2.4142] and
# C2 is supposed to be [0,2.4142,-1.4956]
C = lqr(A, B, Q, R)[0][0]
vf = lambda tau, x: A @ x + np.array([0, -lamb * x[0]**2]) + B.T * (-C @ x)
xLQR = integrate.solve_ivp(vf, (0,50), x0[:,0], first_step=0.01, max_step=0.01)
xLQR = xLQR.y[:,:-2]
C2 = lqr(A2, B2, Q2, R)[0][0]
vf2 = lambda tau, x: A @ x + np.array([0, -lamb * x[0]**2]) + B.T * ((-C2[0:2] @ x) - (C2[2] * x[0]**2))
xKOOC = integrate.solve_ivp(vf2, (0,50), x0[:,0], first_step=0.01, max_step=0.01)
xKOOC = xKOOC.y[:,:-2]
JLQR = np.cumsum(xLQR[0]**2 + xLQR[1]**2 + (C @ np.conjugate(xLQR))**2)
JKOOC = np.cumsum(xKOOC[0]**2 + xKOOC[1]**2 + (C @ np.conjugate(xKOOC))**2)
## Plots
plt.style.use('dark_background')
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(18, 6))
ax1.plot(xLQR[0], xLQR[1], 'w-')
ax1.plot(xKOOC[0], xKOOC[1], 'r--')
ax1.grid(b=True, which='major', color='#666666', linestyle='-')
ax1.set_xlabel('x_1')
ax1.set_ylabel('x_2')
ax2.set_xlim(0, 50)
ax2.plot(tspan, xLQR.T, 'w-')
ax2.plot(tspan, xKOOC.T, 'r--')
ax2.grid(b=True, which='major', color='#666666', linestyle='-')
ax2.set_xlabel('t')
ax2.set_ylabel('x_k')
ax3.set_xlim(0, 50)
ax3.plot(tspan, JLQR, 'w-')
ax3.plot(tspan, JKOOC, 'r--')
ax3.grid(b=True, which='major', color='#666666', linestyle='-')
ax3.set_xlabel('t')
ax3.set_ylabel('J')
plt.subplots_adjust(wspace=0.5)
plt.show()
## Koopman MDP Approach
'''
Let $\mathbf{F}\ y = x$,
\begin{align}
\mathbf{F} = \begin{bmatrix}
1 & 0 & 0\\
0 & 1 & 0
\end{bmatrix}
\implies
\frac{\text{d}x}{\text{d}t} = \mathbf{E}\ \frac{\text{d}y}{\text{d}t}
\end{align}
Note: $\exists$ linear transformation $y \to x$ but not $x \to y$\\\\
Now looking at the continuation term of the HJB equation:
\begin{align}
\mathcal{L} V = \dot{V} = \frac{\text{d}V}{\text{d}x}\ \frac{\text{d}x}{\text{d}t} = \nabla_x V\ \mathbf{F}\ \frac{\text{d}y}{\text{d}t}
\end{align}
In order to characterize $\dot{V}$, we must then derive an expression for $\frac{\text{d}V}{\text{d}x}$. Assuming that $V$ is a linear combination of the Koopman invariant subspace $y$:
\begin{align}
V = \mathbf{B}^\top\ y \implies \nabla_x V &= \nabla_x \mathbf{B}^\top\ y\\
&= \nabla_x (y^\top\ \mathbf{B})\\
&= J_y (x_1, x_2)\ \mathbf{B}\\
&= \begin{bmatrix}
\frac{\partial y_1}{\partial x_1} & \frac{\partial y_2}{\partial x_1} & \frac{\partial y_3}{\partial x_1}\\
\frac{\partial y_1}{\partial x_2} & \frac{\partial y_2}{\partial x_2} & \frac{\partial y_3}{\partial x_2}
\end{bmatrix}
\ \mathbf{B}
\end{align}
If it is correct to think statically (holding time fixed), then the Jacobian matrix is not hard to calculate and is given by
\begin{align*}
J_y (x_1, x_2) =
\begin{bmatrix}
\frac{\partial y_1}{\partial x_1} & \frac{\partial y_2}{\partial x_1} & \frac{\partial y_3}{\partial x_1}\\
\frac{\partial y_1}{\partial x_2} & \frac{\partial y_2}{\partial x_2} & \frac{\partial y_3}{\partial x_2}
\end{bmatrix}
&= \begin{bmatrix}
\mu & 0 & 2x_1\\
0 & 1 & 0
\end{bmatrix}
\end{align*}
Because LQR has no discounting and it is framed as a minimization problem, the HJB becomes
\begin{align}
0 &= \inf_{\pi_t \in \mathcal{P}(U)} \underbrace{
\int_U \left( r(x,u) + \lambda \ln{\pi_t(u)} + (\mathcal{L}v)(x,u)\right) \pi_t(u) \text{d}u}_{H\left(x,\pi(x),\mathcal{L}V\right)} \label{HJB_LQR}
\end{align}
where $r(x,u) = x^\top Q x + u^\top Ru$ and the function $H$ denotes the Hamiltonian to be minimized. The optimal policy is then
\begin{align}
\pi(x) = \argmin_{\pi} H\left(x,\pi(x),\mathcal{L}V\right)
\end{align}
the solution of which is
\begin{align}
\pi^*(u|x) &= \frac{\exp\left(-\frac{1}{\lambda}(r(x,u) + (\mathcal{L}v)(x,u))\right)}{\int_U \exp\left(-\frac{1}{\lambda}(r(x,u) + (\mathcal{L}v)(x,u))\right)du}\label{opt_pol_LQR}
\end{align}
We can use the DP relationship to update the value function as follows
\begin{align}
V_{j+1}(x(t)) = \mathbb{E}_{u\sim\pi}\left(\int_{t}^{t+\tau} r(x(s),u)ds\right) + V_j(x(t+\tau))
\end{align}
Expressing this in terms of the Koopman operator
\begin{align}
V_{j+1}(x(t)) = \mathbb{E}_{u\sim\pi}\left(\int_{t}^{t+\tau} r(x(s),u)ds\right) + \mathcal{K}_\tau V_j(x(t))
\end{align}
Using the relationship between the Koopman operator and its generator $\mathcal{K}_{\tau} = \exp(\tau\mathcal{L})$
\begin{align}
V_{j+1}(x(t)) &= \mathbb{E}_{u\sim\pi}\left(\int_{t}^{t+\tau} r(x(s),u)ds\right) + \exp(\tau\mathcal{L}) V_j(x(t))
\\
&= \int_U\left(\int_{t}^{t+\tau} r(x(s),u)ds\right)\pi(u)du + \exp(\tau\mathcal{L}) V_j(x(t))
\\
&= \int_{t}^{t+\tau}\left( \int_U r(x(s),u)\pi(u)du\right)ds + \exp(\tau\mathcal{L}) V_j(x(t))
\end{align}
Specializing the reward to the special case of LQR $r(x,u) = x^\top Q x + u^\top Ru$
\begin{align}
V_{j+1}(x(t))
&= \int_{t}^{t+\tau}x(t)^\top Q x(t)ds + \tau\int_U u^\top Ru \;\pi(u)du + \exp(\tau\mathcal{L}) V_j(x(t))
\end{align}
Furthermore, if we take the control to be univariate
\begin{align}
V_{j+1}(x(t))
&= \int_{t}^{t+\tau}x(t)^\top Q x(t)ds + \tau \mathbb{E}_{u\sim\pi}(Ru^2) + \exp(\tau\mathcal{L}) V_j(x(t)) \label{BellmanEq}
\end{align}
Next, we outline an algorithm for value function iteration in this case
\begin{enumerate}
\item Simulate states by taking (for now) the optimal policy of $u$ from the KOOC approach as given and using the dynamics of x. This gives us samples $\bf{x_1,...,x_m}$ and $u_1,...,u_{m-1}$. In addition, we also have observations $\bf{y_1,...,y_m}$.
\item Guess initial value function and evaluate it at sample points
\item Use least squares to project onto dictionary space (which is richer than the minimal function space that spans the dynamics). Denote this matrix by $\hat B^\top$.
\item Get the unforced Koopman operator of ${\bf y}=(x_1,x_2,x_1^2)$
\item Calculate $\hat{\mathcal{L}} V = \dot{V} = \frac{\text{d}V}{\text{d}x}\ \frac{\text{d}x}{\text{d}t} = \nabla_x \psi(x)^\top \hat B\ \mathbf{F}\ \frac{\text{d}y}{\text{d}t}$
\item Find opt policy $\pi^*$ using \eqref{opt_pol_LQR}
\item Calculate next value function by plugging in optimal policy and the current value function into the Bellman equation \eqref{BellmanEq}
\end{enumerate}
Unfortunately, dealing with the numerical issues with integration is proving to be difficult, especially when evaluating the optimal policy on each iteration and integrating wrt it.
'''