KL between trajectory distributions vs KL between policies

The derivation presented here is inspired by these lecture notes.

Given a trajectory $\tau = (s_0, a_0, s_1, a_1, \dots, s_{T-1}, a_{T-1})$, and two distributions over trajectories $p_\pi$ and $p_q$ parametrized by policies $\pi$ and $q$ respectively,

where $\mu_0 = \mu_0(s_0)$ is the initial state distribution, $\pi_t = \pi_t(a_t | s_t)$ is the policy at time $t$, and $p_t = p(s_{t+1} | s_t, a_t)$ is the system dynamics, we can find the KL divergence between $p_\pi$ and $p_q$ as follows. By definition,

Denote $\tau_t = (s_0, a_0, \dots, s_{t-1}, a_{t-1})$. At time $t$, we have a state-action distribution $\rho^\pi_t = \rho^\pi_t(s_t, a_t)$ that can be computed as

Notice that it decomposes into the product $\rho^\pi_t(s_t, a_t) = \mu^\pi_t(s_t) \pi_t(a_t | s_t)$.

Coming back to the KL, observe that the ratio $p_\pi / p_q$ has a lot of common terms that cancel out, leading to

Interchanging the order of summation, we arrive at

So, finally, the KL divergence between trajectory distributions is the sum over time of state-averaged KL divergences between policies at different time steps,