# KL minimization vs maximum likelihood estimation

Given samples $x_{1:N}$ from a distribution $\pi$, we consider the empirical distribution $\tilde{\pi}$ with density

as an approximation of the true distribution $\pi$; the more samples we have, the better the approximation.

To find a parametric distribution $\pi^\omega$ that best fits the samples, it is reasonable to minimize the KL divergence between the empirical distribution $\tilde{\pi}$ and our approximating parametric distribution $\pi^\omega$. Denote densities with respect to the Lebesgue measure by a subscript $\lambda$, i.e., $d\pi = \pi_\lambda \,d\lambda$. It is easy to see that minimization of the KL divergence

is equivalent to maximization of the log-likelihood

The reason we consider the KL divergence from $\pi^\omega$ to $\tilde{\pi}$ and not in the opposite direction is because $\tilde{\pi} \ll \pi^\omega$ but not the other way around.