Linearly constrained relative entropy minimization
16 Jan 2018
In reinforcement learning,
one usually seeks to maximize a reward
while staying close to a current policy.
However, one could turn the problem around
and instead minimize the deviation from the current policy
under the constraint that the new policy achieves a higher reward.
Such reformulation is remarkable for the beautiful maths
that accompanies it,
initially discovered by the physicists
under the name free energy minimization.
Some ideas presented here
are known in different communities under different names:
see, for example,
Fenchel duality, entropy, and the log partition function,
Donsker-Varadhan formula,
Compression Lemma;
exponential families in general are wonderfully described by
Wainwright and Jordan.
It is helpful to keep in mind
episode-based policy search
as an example application of the abstract framework presented below.
Let be a given probability measure on (e.g., current policy),
and be a measure on that we want to find (e.g., new policy).
Assuming is absolutely continuous with respect to ,
denote by the Radon-Nikodym derivative of with respect to ,
The quality of a measure is evaluated using a reward function
,
and it is defined as the expectation of under the given measure
All integrals in this note go over .
The following optimization problem yields as its solution,
allowing one to find using ;
Function is assumed to be convex and satisfy the property
;
we additionally assume that the domain of is
in order to avoid adding an explicit non-negativity constraint on .
With such a choice of ,
the objective that we are minimizing
is an -divergence from to .
The first equality constraint requires the expectation of
under the new policy to be equal to ;
choosing guarantees that achieves a higher
reward than .
The last equality constraint ensures that is normalized.
Primal optimal point and dual objective
The Lagrangian of Problem
has an extremal
where denotes the derivative
of the Legendre transform of ;
we put the star at the bottom
to leave space at the top for the derivative symbol.
Using Fenchel’s equality, we arrive at the Lagrangian dual
Without any additional assumptions, we cannot eliminate or
from the dual, and thus have to rely on numeric optimization for finding them.
With a special choice of
the -divergence turns into the KL divergence
(also called relative entropy).
The exponential form of the Legendre transform
renders from
to be an exponential family
allowing us to satisfy the normalization constraint analytically
by setting
where is known as the log-partition function.
(The log-partition function
is closely related to the cumulant generating function,
with the difference that in the cumulant generating function,
the expectation is computed with respect to the measure induced by ,
while in the log-partition function—with respect to
an arbitrary fixed measure ).
The density can then be expressed as a function of only
which turns the dual into
There are a few interesting observation one can make about this dual.
Computing the maximum of
for a given is equivalent to evaluating
the Legendre transform of
at the given ; in other words,
By strong duality, the Legendre transform of the log-partition function
equals the KL divergence from to ,
Although it may be hard in practice to find analytically,
it is an important conceptual tool.
We can give one more little twist to things by expressing
the exponent in through the Bregman divergence
generated by .
Recall that in general the Bregman divergence generated by a
function is given by
Using this definition,
together with the expression for ,
we can rewrite the exponent in at optimum as
Therefore, the optimal density as a function of
can be computed as
Formula provides a nice interpretation of :
it weighs rewarding experiences exponentially
using as a measure of goodness;
at the same time, it punishes deviation of from
using the Bregman divergence generated by .
To get a more hands-on appreciation of the theory,
let’s take a look at a concrete example.
Let the base measure be Gaussian
(despite letter being already taken,
we nevertheless denote the mean by
in order to keep the standard notation
for the parameters of a Gaussian distribution;
no confusion should arise from such abuse of notation),
and let the reward be quadratic
The log-partition function can then be computed in closed form
From , we know that
which gives us an equation on .
In this simple case, it is a quadratic equation
with only one root satisfying the improvement condition
which guarantees that
the expectation of is greater under than under ,
Note that must be non-positive because .
Substituting from
and from into Fenchel’s equality
we find
where is a function of
defined in .
Recall from that
We can compute the right-hand side directly in order to confirm this equality.
First, using , we can find
Then, using the formula for the KL between Gaussians
we can confirm that indeed equals
as defined in .
As a bonus, you may check that substituting
into yields the following expression
for as an explicit function of
It is a bit sad that even for such a simple example,
the convex conjugate of the log-partition function is so ugly.
Fortunately, the variational definition ,
being a convex optimization problem, makes it possible to find
for any given quickly and reliably,
thus eliminating the need of an explicit formula for
for all practical purposes.