Deriving the HJB equation

Russ Tedrake mentions the Hamilton-Jacobi-Bellman equation in the course on Underactuated Robotics, forwarding the reader to Dynamic Programming and Optimal Control by Dimitri Bertsekas for a nice intuitive derivation, that starts from a discrete version of Bellman’s optimality principle yielding the HJB equation in a limit. Here, instead, the HJB equation is derived directly from the additivity of the long-term cost function, for Bellman’s optimality principle itself is a consequence of the additive cost structure.

Preliminaries

Consider a trajectory $x$ developing according to a differential equation

driven by a controller $u$. Performance of a controller $u$ on a trajectory starting from an initial state $x(0)=x_{0}$ is measured by a cumulative cost

defined as an integral of an immediate cost $g(x, u)$. A controller $u$ that minimizes the cumulative cost $J$ is called an optimal controller. It is denoted by

Note that the optimal controller $u^{*}$ is parameterized by $x_0$. The corresponding optimal cumulative cost

viewed as a function of the initial state $x_0$ is known as the value function.

Derivation

Remarkably, $J^{*}$ considered as a function of $x_0$ turns out to satisfy a certain differential equation. Our goal in this section is to derive that equation.

For any admissible $u$, the integral in the definition of the cumulative cost can be split as follows

where $u_0 = u(0)$. Approximating the next state from the differential equation as

we are able to develop $J$ into a series

provided that $J$ is sufficiently smooth. Substituting it into the expression above, we obtain

which after simplification, in the limit $\Delta t \to 0$ gives

This equality holds for any admissible controller $u$. In particular, for the optimal controller $u^*$, we obtain

Since this equality is true for any initial state $x_0$ and any optimal initial action $u_0^{*} = u^{*}(0, x_0)$, we can rewrite it as

Hamilton-Jacobi-Bellman equation $$ \min_{u_0} \left[ g(x_0, u_0) + J^*_x(x_0) \cdot f(x_0, u_0) \right] = 0, \quad\forall x_0. $$

Note that minimization here is performed over a single action $u_0$, in contrast to the problem formulation we started with, where minimization is performed over functions $u$. As a consequence, the optimal controller $u^{*}$ is time-invariant, $u^{*}(t, x_0) = u^{*}(x_0)$, satisfying

$$ u^*(x_0) = \arg \min_{u_0} \left[ g(x_0, u_0) + J^*_x(x_0) \cdot f(x_0, u_0) \right], \quad\forall x_0. $$

For further discussion of dynamic programming, turn to the sources.