The EM algorithm

This post follows the exposition of the EM algorithm from Section 9.4 of Bishop’s book that I condensed a bit to have the full derivation on one page.

The likelihood $p(X | \theta)$ of the data $X$ given the parameters $\theta$ can be written as

where $p(X, Z | \theta)$ is the complete-data likelihood and $q(Z)$ is the variational distribution over the hidden variables $Z$. Taking the logarithm of the both sides, we obtain

The left-hand side does not depend on $Z$, therefore it remains unchanged upon integration with respect to $q$,

For a fixed value of the parameter $\theta = \theta^{\mathrm{old}}$, the right-hand side is a constant. Using the freedom in the choice of $q$, we can maximize the lower bound $\mathcal{L}(q, \theta^\mathrm{old})$ by minimizing the KL divergence $\mathrm{KL}(q\|p)$ with respect to $q$. This gives us $q(Z) = p(Z | X, \theta^{\mathrm{old}})$, which corresponds to the E step of the EM algorithm. If after that we maximize $\mathcal{L}(q, \theta)$ with respect to $\theta$ while keeping $q$ fixed, we obtain a new parameter value $\theta^\mathrm{new}$ for which the lower bound improves $\mathcal{L}^{\mathrm{new}} \geq \mathcal{L}^{\mathrm{old}}$. Since the KL divergence is non-negative, the log-likelihood improves too, $\ln p(X | \theta^{\mathrm{new}}) \geq \ln p(X | \theta^{\mathrm{old}})$. Maximization of the lower bound $\mathcal{L}(q, \theta)$ with respect to the parameters $\theta$ constitutes the M step of the EM algorithm.