Fisher metric vs KL-divergence

16 Oct 2016

Let $P$ and $Q$ be probability measures over a set $X$ , and let $P$ be absolutely continuous with respect to $Q$ . If $μ$ is any measure on $X$ for which $p = \frac{d P}{d μ}$ and $q = \frac{d Q}{d μ}$ exist, then the Kullback-Leibler divergence from $Q$ to $P$ is given as

D_{KL} (P ‖ Q) = \int_{X} p \log \frac{p}{q} d μ .

Let the density $q = q (x, θ)$ be parameterized by a vector $θ$ and let $p$ be a variation of $q$ , i.e., $p = q + δ q$ , where $δ q = \frac{\partial q}{\partial θ_{m}} δ θ_{m}$ . Then

\begin{aligned} D_{KL} (P ‖ Q) & = \int_{X} p \log \frac{p}{q} d μ = \int_{X} (q + δ q) \log \frac{q + δ q}{q} d μ \\ = \int_{X} q \log (1 + \frac{δ q}{q}) d μ + \int_{X} δ q \log (1 + \frac{δ q}{q}) d μ \\ \approx \int_{X} q (\frac{δ q}{q} - \frac{(δ q)^{2}}{2 q^{2}}) d μ + \int_{X} δ q \frac{δ q}{q} d μ \\ = \int_{X} δ q d μ + \frac{1}{2} \int_{X} q \frac{(δ q)^{2}}{q^{2}} d μ \\ = δ θ_{m} \frac{\partial}{\partial θ_{m}} \int_{X} q d μ + \frac{1}{2} δ θ_{k} δ θ_{j} \int_{X} q (\frac{1}{q} \frac{\partial q}{\partial θ_{k}}) (\frac{1}{q} \frac{\partial q}{\partial θ_{j}}) d μ \\ = 0 + \frac{1}{2} δ θ_{k} δ θ_{j} E {\frac{\partial \log q}{\partial θ_{k}} \frac{\partial \log q}{\partial θ_{j}}} \\ = \frac{1}{2} δ θ_{k} δ θ_{j} g_{j k} (θ), \end{aligned}

where we recognize $g_{j k}$ , the Fisher information metric,

g_{j k} (θ) = E {\frac{\partial \log q}{\partial θ_{k}} \frac{\partial \log q}{\partial θ_{j}}} .

Thus, the Fisher information metric is the second derivative of the Kullback-Leibler divergence,

g_{j k} (θ_{0}) = {\frac{\partial^{2}}{\partial θ_{k} \partial θ_{j}} |}_{θ = θ_{0}} D_{KL} (Q (θ) ‖ Q (θ_{0})) .

Bonus: one prominent equality for the Fisher information¶

Let’s prove the following useful equality:

E \frac{\partial \log p}{\partial θ_{k}} \frac{\partial \log p}{\partial θ_{j}} = - E \frac{\partial^{2} \log p}{\partial θ_{k} \partial θ_{j}} .

Consider the argument on the right-hand side:

\frac{\partial}{\partial θ_{k}} (\frac{\partial \log p}{\partial θ_{j}}) = \frac{\partial}{\partial θ_{k}} (\frac{1}{p} \frac{\partial p}{\partial θ_{j}}) = - \frac{1}{p^{2}} \frac{\partial p}{\partial θ_{k}} \frac{\partial p}{\partial θ_{j}} + \frac{1}{p} \frac{\partial^{2} p}{\partial θ_{k} \partial θ_{j}} .

Compute its expectation:

- E \frac{\partial^{2} \log p}{\partial θ_{k} \partial θ_{j}} = \int_{X} p \frac{\partial \log p}{\partial θ_{k}} \frac{\partial \log p}{\partial θ_{j}} d μ - \frac{\partial^{2}}{\partial θ_{k} \partial θ_{j}} \int_{X} p d μ .

The second term on the right equals zero, which concludes the proof. Derivations in this post closely follow the book by Kullback.

Boris Belousov

Fisher metric vs KL-divergence

Bonus: one prominent equality for the Fisher information¶