Fisher metric vs KL-divergence

Let P and Q be probability measures over a set X, and let P be absolutely continuous with respect to Q. If μ is any measure on X for which p=dPdμ and q=dQdμ exist, then the Kullback-Leibler divergence from Q to P is given as

DKL(PQ)=Xplogpqdμ.

Let the density q=q(x,θ) be parameterized by a vector θ and let p be a variation of q, i.e., p=q+δq, where δq=qθmδθm. Then

DKL(PQ)=Xplogpqdμ=X(q+δq)logq+δqqdμ=Xqlog(1+δqq)dμ+Xδqlog(1+δqq)dμXq(δqq(δq)22q2)dμ+Xδqδqqdμ=Xδqdμ+12Xq(δq)2q2dμ=δθmθmXqdμ+12δθkδθjXq(1qqθk)(1qqθj)dμ=0+12δθkδθjE{logqθklogqθj}=12δθkδθjgjk(θ),

where we recognize gjk, the Fisher information metric,

gjk(θ)=E{logqθklogqθj}.

Thus, the Fisher information metric is the second derivative of the Kullback-Leibler divergence,

gjk(θ0)=2θkθj|θ=θ0DKL(Q(θ)Q(θ0)).

Bonus: one prominent equality for the Fisher information

Let’s prove the following useful equality:

Elogpθklogpθj=E2logpθkθj.

Consider the argument on the right-hand side:

θk(logpθj)=θk(1ppθj)=1p2pθkpθj+1p2pθkθj.

Compute its expectation:

E2logpθkθj=Xplogpθklogpθjdμ2θkθjXpdμ.

The second term on the right equals zero, which concludes the proof. Derivations in this post closely follow the book by Kullback.