KL minimization vs maximum likelihood estimation

23 Apr 2017

Given samples $x_{1:N}$ from a distribution $\pi$, we consider the empirical distribution $\tilde{\pi}$ with density

$d\tilde{\pi} = \frac{1}{N} \sum_{i=1}^N \delta(x - x_i)\,dx$

as an approximation of the true distribution $\pi$; the more samples we have, the better the approximation.

To find a parametric distribution $\pi^\omega$ that best fits the samples, it is reasonable to minimize the KL divergence between the empirical distribution $\tilde{\pi}$ and our approximating parametric distribution $\pi^\omega$. Denote densities with respect to the Lebesgue measure by a subscript $\lambda$, i.e., $d\pi = \pi_\lambda \,d\lambda$. It is easy to see that minimization of the KL divergence

$D(\tilde{\pi} || \pi^\omega) = \int \log \tilde{\pi}_\lambda \,d\tilde{\pi} - \int \log \pi_\lambda^\omega \,d\tilde{\pi},$

is equivalent to maximization of the log-likelihood

$\min_\omega D(\tilde{\pi} || \pi^\omega) = \max_\omega \int \log \pi_\lambda^\omega \,d\tilde{\pi} = \max_\omega \left( \frac{1}{N}\sum_{i=1}^N \log \pi_\lambda^\omega(x_i) \right).$

The reason we consider the KL divergence from $\pi^\omega$ to $\tilde{\pi}$ and not in the opposite direction is because $\tilde{\pi} \ll \pi^\omega$ but not the other way around.

Boris Belousov

KL minimization vs maximum likelihood estimation