This article is a brief summary of some relationships between the log-likelihood, score, Kullback-Leibler divergence and Fisher information. No explanations, just pure math.

Log-likelihood

The log-likelihood is defined the logarithm of the likelihood

Let’s perform a Taylor approximation of the log-likelihood \(\log p(x|\theta’)\) around the current estimate \(\theta\):

$$\begin{align*} \log p(x|\theta') =& \log p(x|\theta')|_{\theta' = \theta} + \sum_i \left. \frac{\partial \log p(x|\theta')}{\partial \theta'_i} \right| _{\theta' = \theta} (\theta'_i - \theta_i) \\ &+ \frac{1}{2}\sum_i\sum_j \left. \frac{\partial^2 \log p(x|\theta')}{\partial \theta'_i\partial \theta'_j}\right| _{\theta' = \theta} (\theta'_i - \theta_i)(\theta'_j - \theta_j) + ... \end{align*}$$
$$\begin{align*} \log p(x|\theta') =& \log p(x|\theta')|_{\theta' = \theta} + \left. \nabla_{\theta'} \log p(x|\theta')^T\right| _{\theta' = \theta} (\theta' - \theta) \\ &+ \frac{1}{2}\left. (\theta' - \theta)^T\nabla^2_{\theta'} \log p(x|\theta')\right| _{\theta' = \theta}(\theta' - \theta) + ... \end{align*}$$

Scalar

Vector

The linear term \(\frac{\partial}{\partial \theta’_i} \log p(x|\theta’)\)\(\nabla_{\theta’} \log p(x|\theta’)\) in this decomposition can be written as

$$ \frac{\partial}{\partial \theta'_i} \log p(x|\theta') = \frac{1}{p(x|\theta')} \frac{\partial}{\partial \theta'_i} p(x|\theta') $$
$$ \nabla_{\theta'} \log p(x|\theta') = \frac{1}{p(x|\theta')}\nabla_{\theta'} p(x|\theta') $$

Scalar

Vector

by using the log derivative trick.

Evaluated at \( \theta’ = \theta \) you receive

$$ \left. \frac{\partial}{\partial \theta'_i} \log p(x|\theta') \right|_{\theta' = \theta} = \frac{1}{p(x|\theta)} \frac{\partial}{\partial \theta_i} p(x|\theta). $$
$$ \left. \nabla_{\theta'} \log p(x|\theta') \right|_{\theta' = \theta} = \frac{1}{p(x|\theta)}\nabla_{\theta} p(x|\theta). $$

Scalar

Vector

The quadratic term of the decomposition \(\frac{\partial^2 \log p(x|\theta’)}{\partial \theta’_i\partial \theta’_j}\) \(\nabla^2_{\theta’} \log p(x|\theta’)\) can be written as

$$ \begin{align*} \frac{\partial^2 \log p(x|\theta')}{\partial \theta'_\partial \theta'_j} =& \frac{\partial }{\partial \theta'_j} \left( \frac{\partial}{\partial \theta'_i} \log p(x|\theta')\right) \\ =& \frac{\partial }{\partial \theta'_j} \left( \frac{1}{p(x|\theta')} \frac{\partial}{\partial \theta'_i} p(x|\theta')\right) \\ =& \frac{\partial }{\partial \theta'_j} \left( \frac{1}{p(x|\theta')}\right) \frac{\partial}{\partial \theta'_i} p(x|\theta') + \frac{1}{p(x|\theta')} \frac{\partial }{\partial \theta'_j} \left(\frac{\partial}{\partial \theta'_i} p(x|\theta')\right)\\ =& \frac{\partial }{\partial \theta'_j} \left( \frac{1}{p(x|\theta')}\right) \frac{\partial}{\partial \theta'_i} p(x|\theta') + \frac{1}{p(x|\theta')} \frac{\partial^2 p(x|\theta')}{\partial \theta'_\partial \theta'_j} \\ =& - \frac{1}{p(x|\theta')^2} \frac{\partial}{\partial \theta'_j} p(x|\theta') \frac{\partial}{\partial \theta'_i} p(x|\theta') + \frac{1}{p(x|\theta')} \frac{\partial^2 p(x|\theta')}{\partial \theta'_\partial \theta'_j} \\ =& - \frac{\partial}{\partial \theta'_j} \log p(x|\theta') \frac{\partial}{\partial \theta'_i} \log p(x|\theta') + \frac{1}{p(x|\theta')} \frac{\partial^2 p(x|\theta')}{\partial \theta'_\partial \theta'_j} \end{align*} $$
$$ \begin{align*} \nabla^2_{\theta'} \log p(x|\theta') =& \nabla_{\theta'} \nabla_{\theta'}^T \log p(x|\theta') \\ =& \nabla_{\theta'} \left(\frac{1}{p(x|\theta')}\nabla_{\theta'}^T p(x|\theta')\right) \\ =& \nabla_{\theta'} p(x|\theta') \nabla_{\theta'}^T \left(\frac{1}{p(x|\theta')}\right) + \frac{1}{p(x|\theta')}\nabla_{\theta'}^T\nabla_{\theta'} p(x|\theta') \\ =&- \frac{1}{p(x|\theta')^2} \nabla_{\theta'} p(x|\theta') \nabla_{\theta'} p(x|\theta') ^T + \frac{1}{p(x|\theta')}\nabla_{\theta'}^2 p(x|\theta') \\ =&- \nabla_{\theta'} \log p(x|\theta') \nabla_{\theta'} \log p(x|\theta') ^T + \frac{1}{p(x|\theta')}\nabla_{\theta'}^2 p(x|\theta') \end{align*} $$

Scalar

Vector

Evaluated at \( \theta’ = \theta \) you receive

$$ \begin{align*} \left. \frac{\partial^2 \log p(x|\theta')}{\partial \theta'_\partial \theta'_j}\right|_{\theta' = \theta} =& - \frac{\partial}{\partial \theta_j} \log p(x|\theta) \frac{\partial}{\partial \theta_i} \log p(x|\theta) + \frac{1}{p(x|\theta)} \frac{\partial^2 p(x|\theta)}{\partial \theta_\partial \theta_j} \end{align*}. $$
$$ \begin{align*} \left. \nabla^2_{\theta'} \log p(x|\theta')\right|_{\theta' = \theta} =&- \nabla_{\theta} \log p(x|\theta) \nabla_{\theta} \log p(x|\theta) ^T + \frac{1}{p(x|\theta)}\nabla_{\theta}^2 p(x|\theta) \end{align*}. $$

Scalar

Vector

Finally, you can express the Taylor approximation as

$$\begin{align*} \log p(x|\theta') =& \log p(x|\theta) + \sum_i \frac{1}{p(x|\theta)} \frac{\partial}{\partial \theta_i} p(x|\theta) (\theta'_i - \theta_i) \\ &+ \frac{1}{2} \sum_i\sum_j \left(- \nabla_{\theta} \log p(x|\theta) \nabla_{\theta} \log p(x|\theta) ^T + \frac{1}{p(x|\theta)}\nabla_{\theta}^2 p(x|\theta)\right) (\theta'_i - \theta_i)(\theta'_j - \theta_j) + ... \end{align*}$$
$$\begin{align*} \log p(x|\theta') =& \log p(x|\theta) + \left. \nabla_{\theta'} \log p(x|\theta')^T\right| _{\theta' = \theta} (\theta' - \theta) \\ &+ \left. \frac{1}{2} (\theta' - \theta)^T\nabla^2_{\theta'} \log p(x|\theta')\right| _{\theta' = \theta}(\theta' - \theta) + ... \end{align*}$$

Scalar

Vector

Mean

Variance

Score

The score is defined as the derivative of the log-likelihood

$$ V_i(x;\theta') = \frac{\partial}{\partial \theta'_i} \log p(x|\theta') $$
$$ V(x;\theta') = \nabla_{\theta'} \log p(x|\theta') $$

Scalar

Vector

Mean

$$ \mathbb{E}_{p(x|\theta^*)}\left[\frac{\partial}{\partial \theta'_i} \log p(x|\theta')\right] = \int \limits_{-\infty}^{\infty} p(x|\theta^*) \frac{\partial}{\partial \theta'_i} \log p(x|\theta')dx$$
$$ \mathbb{E}_{p(x|\theta^*)}\left[\nabla_{\theta'} \log p(x|\theta')\right] = \int \limits_{-\infty}^{\infty} p(x|\theta^*) \nabla_{\theta'} \log p(x|\theta')dx$$

Scalar

Vector

Variance

\begin{align*} \text{Var}\left(\frac{\partial}{\partial \theta'_i} \log p(x|\theta')\right) &= \mathbb{E}_{p(x|\theta^*)}\left[\left(\frac{\partial}{\partial \theta'_i} \log p(x|\theta')-\mathbb{E}_{p(x|\theta^*)}\left[\frac{\partial}{\partial \theta'_i} \log p(x|\theta')\right]\right)^2\right] \\ &= \mathbb{E}_{p(x|\theta^*)}\left[\frac{\partial}{\partial \theta'_i} \log p(x|\theta')\frac{\partial}{\partial \theta'_i} \log p(x|\theta')\right] + \mathbb{E}_{p(x|\theta^*)}\left[\frac{\partial}{\partial \theta'_i} \log p(x|\theta')\right]^2 \\ \end{align*}
\begin{align*} \text{Var}\left(\nabla_{\theta'} \log p(x|\theta')\right) &= \mathbb{E}_{p(x|\theta^*)}\left[\left(\nabla_{\theta'}\log p(x|\theta')-\mathbb{E}_{p(x|\theta^*)}\left[\nabla_{\theta'} \log p(x|\theta')\right]\right)\left(\nabla_{\theta'}\log p(x|\theta')-\mathbb{E}_{p(x|\theta^*)}\left[\nabla_{\theta'} \log p(x|\theta')\right]\right)^T\right] \\ &= \mathbb{E}_{p(x|\theta^*)}\left[\nabla_{\theta'} \log p(x|\theta')\nabla_{\theta'} \log p(x|\theta')^T\right] + \mathbb{E}_{p(x|\theta^*)}\left[\nabla_{\theta'} \log p(x|\theta')\right]\mathbb{E}_{p(x|\theta^*)}\left[\nabla_{\theta'} \log p(x|\theta')\right]^T \\ \end{align*}

Scalar

Vector

Kullback-Leibler divergence

The Kullback-Leibler divergence is defined as

Let’s perform a Taylor approximation around the current estimate \(\theta\):

$$ \begin{align*} D_\textrm{KL}(\theta||\theta') =& D_\textrm{KL}(\theta||\theta')|_{\theta' = \theta} + \sum_i \left.\frac{\partial D_\textrm{KL}(\theta||\theta')}{\partial \theta'_i}\right|_{\theta' = \theta} (\theta'_i - \theta_i) \\ &+ \frac{1}{2}\sum_i\sum_j \left.\frac{\partial^2 D_\textrm{KL}(\theta||\theta')}{\partial \theta'_i\partial \theta'_j}\right|_{\theta' = \theta} (\theta'_i - \theta_i) (\theta'_j - \theta_j) + ... \end{align*} $$
$$ \begin{align*} D_\textrm{KL}(\theta||\theta') =& D_\textrm{KL}(\theta||\theta')|_{\theta' = \theta} + \left. \nabla_{\theta'}D_\textrm{KL}(\theta||\theta')^T\right|_{\theta' = \theta}(\theta' - \theta) \\ & + \frac{1}{2}\left. (\theta' - \theta)^T\nabla^2_{\theta'}D_\textrm{KL}(\theta||\theta')\right|_{\theta' = \theta}(\theta' - \theta) + ... \end{align*} $$

Scalar

Vector

The constant term can be written as

$$ \begin{align*} D_\textrm{KL}(\theta||\theta')|_{\theta' = \theta} =& 0. \end{align*} $$
$$ \begin{align*} D_\textrm{KL}(\theta||\theta')|_{\theta' = \theta} =& 0. \end{align*} $$

Scalar

Vector

The linear term \(\frac{\partial D_\textrm{KL}(\theta||\theta’)}{\partial \theta’_i}\) \(\nabla_{\theta’}D_\textrm{KL}(\theta||\theta’)\) in this decomposition can be written as

$$ \begin{align*} \frac{\partial D_\textrm{KL}(\theta||\theta')}{\partial \theta'_i} =& \frac{\partial}{\partial \theta'_i}\left(\int \limits_{-\infty}^{\infty} p(x|\theta) \log p(x|\theta) dx -\int \limits_{-\infty}^{\infty} p(x|\theta) \log p(x|\theta') dx\right) \\ =& -\int \limits_{-\infty}^{\infty} p(x|\theta) \frac{\partial}{\partial \theta'_i}\log p(x|\theta') dx\\ =& -\int \limits_{-\infty}^{\infty} p(x|\theta)\frac{1}{p(x|\theta')}\frac{\partial}{\partial \theta'_i} p(x|\theta') dx \\ =& - \mathbb{E}_{p(x|\theta)}\left[\frac{\partial}{\partial \theta'_i} \log p(x|\theta')\right]\\ =& - \mathbb{E}_{p(x|\theta)}\left[\frac{1}{p(x|\theta')}\frac{\partial}{\partial \theta'_i} p(x|\theta')\right]. \end{align*} $$
$$ \begin{align*} \nabla_{\theta'}D_\textrm{KL}(\theta||\theta') =& \nabla_{\theta'}\left(\int \limits_{-\infty}^{\infty} p(x|\theta) \log p(x|\theta) dx -\int \limits_{-\infty}^{\infty} p(x|\theta) \log p(x|\theta') dx\right) \\ =& -\int \limits_{-\infty}^{\infty} p(x|\theta)\nabla_{\theta'} \log p(x|\theta') dx \\ =& -\int \limits_{-\infty}^{\infty} p(x|\theta)\frac{1}{p(x|\theta')}\nabla_{\theta'} p(x|\theta') dx \\ =& - \mathbb{E}_{p(x|\theta)}\left[\nabla_{\theta'} \log p(x|\theta')\right]\\ =& - \mathbb{E}_{p(x|\theta)}\left[\frac{1}{p(x|\theta')}\nabla_{\theta'} p(x|\theta')\right]. \end{align*} $$

Scalar

Vector

Evaluated at \( \theta’ = \theta \) you receive

$$ \begin{align*} \left. \frac{\partial D_\textrm{KL}(\theta||\theta')}{\partial \theta'_i} \right|_{\theta' = \theta} =& - \mathbb{E}_{p(x|\theta)}\left[\frac{1}{p(x|\theta)}\frac{\partial}{\partial \theta_i} p(x|\theta)\right] \\ =& \int \limits_{-\infty}^{\infty} p(x|\theta) \frac{1}{p(x|\theta)}\frac{\partial}{\partial \theta_i}p(x|\theta) dx \\ =& \int \limits_{-\infty}^{\infty} \frac{\partial}{\partial \theta_i} p(x|\theta) dx \\ =& \frac{\partial}{\partial \theta_i}\int \limits_{-\infty}^{\infty} p(x|\theta) dx \\ =& \frac{\partial}{\partial \theta_i} 1 \\ =& 0 \end{align*} $$
$$ \begin{align*} \left. \nabla_{\theta'}D_\textrm{KL}(\theta||\theta') \right|_{\theta' = \theta} =& - \mathbb{E}_{p(x|\theta)}\left[\frac{1}{p(x|\theta)}\nabla_{\theta} p(x|\theta)\right] \\ =& \int \limits_{-\infty}^{\infty} p(x|\theta) \frac{1}{p(x|\theta)}\nabla_{\theta} p(x|\theta) dx \\ =& \int \limits_{-\infty}^{\infty} \nabla_{\theta} p(x|\theta) dx \\ =& \nabla_{\theta}\int \limits_{-\infty}^{\infty} p(x|\theta) dx \\ =& \nabla_{\theta} 1 \\ =& 0 \end{align*} $$

Scalar

Vector

The quadratic term \(\frac{\partial^2 D_\textrm{KL}(\theta||\theta’)}{\partial \theta’_i\partial \theta’_j}\) \(\nabla^2_ {\theta’}D_\textrm{KL}(\theta’||\theta’)\) in this decomposition can be written as

$$ \begin{align*} \frac{\partial^2 D_\textrm{KL}(\theta||\theta')}{\partial \theta'_i\partial \theta'_j} =& \frac{\partial }{\partial \theta'_j} \left( \frac{\partial}{\partial \theta'_i} D_\textrm{KL}(\theta||\theta')\right) \\ =& -\frac{\partial }{\partial \theta'_j} \left( \int \limits_{-\infty}^{\infty} p(x|\theta)\frac{\partial}{\partial \theta'_i} \log p(x|\theta') dx \right) \\ =& - \int \limits_{-\infty}^{\infty} p(x|\theta)\frac{\partial^2 \log p(x|\theta')}{\partial \theta'_i \partial \theta'_j} dx \\ =& -\mathbb{E}_{p(x|\theta)}\left[\frac{\partial^2 \log p(x|\theta')}{\partial \theta'_i \partial \theta'_j}\right] \end{align*}$$
$$ \begin{align*} \nabla^2_{\theta'}D_\textrm{KL}(\theta||\theta') =& \nabla_{\theta'} \nabla_{\theta'}^T D_\textrm{KL}(\theta||\theta') \\ =& -\nabla_{\theta'} \left( \int \limits_{-\infty}^{\infty} p(x|\theta)\nabla_{\theta'}^T \log p(x|\theta') dx \right) \\ =& - \int \limits_{-\infty}^{\infty} p(x|\theta)\nabla_{\theta'}^2 \log p(x|\theta') dx \\ =& -\mathbb{E}_{p(x|\theta)}\left[\nabla_{\theta'}^2 \log p(x|\theta')\right]\\ \end{align*}$$

Scalar

Vector

Evaluated at \( \theta’ = \theta \) you receive

$$ \begin{align*} \left. \frac{\partial^2 D_\textrm{KL}(\theta||\theta')}{\partial \theta'_i\partial \theta'_j} \right|_{\theta' = \theta} =& -\mathbb{E}_{p(x|\theta)}\left[\frac{\partial^2 \log p(x|\theta)}{\partial \theta_i \partial \theta_j}\right] \end{align*}$$
$$ \begin{align*} \left. \nabla^2_{\theta'}D_\textrm{KL}(\theta||\theta') \right|_{\theta' = \theta} =& -\mathbb{E}_{p(x|\theta)}\left[\nabla_{\theta}^2 \log p(x|\theta)\right]\\ \end{align*}$$

Scalar

Vector

Finally, you can express the Taylor approximation as

$$ \begin{align*} D_\textrm{KL}(\theta||\theta') =& -\frac{1}{2}\sum_i\sum_j \mathbb{E}_{p(x|\theta)}\left[\frac{\partial^2 \log p(x|\theta)}{\partial \theta_i \partial \theta_j} \right] (\theta'_i - \theta_i) (\theta'_j - \theta_j) + ... \end{align*} $$
$$ \begin{align*} D_\textrm{KL}(\theta||\theta') =& -\frac{1}{2} (\theta' - \theta)^T\mathbb{E}_{p(x|\theta)}\left[\nabla_{\theta}^2 \log p(x|\theta)\right](\theta' - \theta) + ... \end{align*} $$

Scalar

Vector

Cross entropy

The cross entropy is defined as

The cross entropy and the Kullback-Leibler differ only by the constant entropy \(H(\theta\). Therefore, the linear and quadratic terms of those are the same.

Fisher Information

The fisher definition can be defined as …

… expectation of the squared score

$$ \left[\mathcal{I(\theta)}\right]_{ij} = \mathbb{E}_{p(x|\theta)}\left[\frac{\partial}{\partial \theta_j} \log p(x|\theta) \frac{\partial}{\partial \theta_i} \log p(x|\theta)\right] $$
$$ \mathcal{I(\theta)} = \mathbb{E}_{p(x|\theta)}\left[\nabla_{\theta} \log p(x|\theta) \nabla_{\theta} \log p(x|\theta)^T\right] $$

Scalar

Vector

… variance of the score

$$ \begin{align*} \left[\mathcal{I(\theta)}\right]_{ij} &= \text{Var}\left(\frac{\partial}{\partial \theta_i} \log p(x|\theta)\right) \\ &= \mathbb{E}_{p(x|\theta)}\left[\frac{\partial}{\partial \theta_i} \log p(x|\theta)\frac{\partial}{\partial \theta_i} \log p(x|\theta)\right] + \mathbb{E}_{p(x|\theta)}\left[\frac{\partial}{\partial \theta_i} \log p(x|\theta)\right]^2 \\ \end{align*} $$
$$ \begin{align*} \mathcal{I(\theta)} &= \text{Var}\left(\nabla_{\theta'} \log p(x|\theta')\right) \\ &= \mathbb{E}_{p(x|\theta)}\left[\nabla_{\theta'} \log p(x|\theta')\nabla_{\theta'} \log p(x|\theta')^T\right] + \mathbb{E}_{p(x|\theta)}\left[\nabla_{\theta'} \log p(x|\theta')\right]\mathbb{E}_{p(x|\theta)}\left[\nabla_{\theta'} \log p(x|\theta')\right]^T \\ \end{align*} $$

Scalar

Vector

With

$$ \begin{align*} \mathbb{E}_{p(x|\theta)}\left[\frac{\partial}{\partial \theta_i} \log p(x|\theta)\right] =& \mathbb{E}_{p(x|\theta)}\left[\frac{1}{p(x|\theta)}\frac{\partial}{\partial \theta_i} p(x|\theta)\right] \\ =& \int \limits_{-\infty}^{\infty} p(x|\theta) \frac{1}{p(x|\theta)}\frac{\partial}{\partial \theta_i}p(x|\theta) dx \\ =& \int \limits_{-\infty}^{\infty} \frac{\partial}{\partial \theta_i} p(x|\theta) dx \\ =& \frac{\partial}{\partial \theta_i}\int \limits_{-\infty}^{\infty} p(x|\theta) dx \\ =& \frac{\partial}{\partial \theta_i} 1 \\ =& 0 \end{align*}$$
$$ \begin{align*} \mathbb{E}_{p(x|\theta)}\left[\nabla_{\theta} \log p(x|\theta)\right] =& \mathbb{E}_{p(x|\theta)}\left[\frac{1}{p(x|\theta)}\nabla_{\theta} p(x|\theta)\right] \\ =& \int \limits_{-\infty}^{\infty} p(x|\theta) \frac{1}{p(x|\theta)}\nabla_{\theta} p(x|\theta) dx \\ =& \int \limits_{-\infty}^{\infty} \nabla_{\theta} p(x|\theta) dx \\ =& \nabla_{\theta}\int \limits_{-\infty}^{\infty} p(x|\theta) dx \\ =& \nabla_{\theta} 1 \\ =& 0 \end{align*}$$

Scalar

Vector

follows

$$ \left[\mathcal{I(\theta)}\right]_{ij} = \mathbb{E}_{p(x|\theta)}\left[\frac{\partial}{\partial \theta_j} \log p(x|\theta) \frac{\partial}{\partial \theta_i} \log p(x|\theta)\right] $$
$$ \mathcal{I(\theta)} = \mathbb{E}_{p(x|\theta)}\left[\nabla_{\theta} \log p(x|\theta) \nabla_{\theta} \log p(x|\theta)^T\right] $$

Scalar

Vector

… curvature of the KL-divergence

$$ \begin{align*} \left[\mathcal{I(\theta)}\right]_{ij} =&\left. \frac{\partial^2 D_\textrm{KL}(\theta||\theta')}{\partial \theta_i\partial \theta_j} \right|_{\theta' = \theta}\\ =& -\mathbb{E}_{p(x|\theta)}\left[\frac{\partial^2 \log p(x|\theta)}{\partial \theta_i \partial \theta_j}\right] \\ =& - \mathbb{E}_{p(x|\theta)}\left[-\frac{\partial}{\partial \theta_j} \log p(x|\theta) \frac{\partial}{\partial \theta_i} \log p(x|\theta) + \frac{1}{p(x|\theta)} \frac{\partial^2 p(x|\theta)}{\partial \theta_i\partial \theta_j}\right] \\ =& \mathbb{E}_{p(x|\theta)}\left[\frac{\partial}{\partial \theta_j} \log p(x|\theta) \frac{\partial}{\partial \theta_i} \log p(x|\theta)\right] - \mathbb{E}_{p(x|\theta)}\left[\frac{1}{p(x|\theta)} \frac{\partial^2 p(x|\theta)}{\partial \theta_\partial \theta_j}\right]\\ \end{align*}$$
$$ \begin{align*} \mathcal{I(\theta)} =&\left. \nabla^2_{\theta}D_\textrm{KL}(\theta||\theta) \right|_{\theta = \theta}\\ =& -\mathbb{E}_{p(x|\theta)}\left[\nabla_{\theta}^2 \log p(x|\theta)\right]\\ =& - \mathbb{E}_{p(x|\theta)}\left[-\nabla_{\theta} \log p(x|\theta) \nabla_{\theta} \log p(x|\theta)^T + \frac{1}{p(x|\theta)} \nabla^2_{\theta} \log p(x|\theta)\right] \\ =& \mathbb{E}_{p(x|\theta)}\left[\nabla_{\theta} \log p(x|\theta) \nabla_{\theta} \log p(x|\theta)^T\right] - \mathbb{E}_{p(x|\theta)}\left[\frac{1}{p(x|\theta)} \nabla^2_{\theta} \log p(x|\theta)\right]\\ \end{align*}$$

Scalar

Vector

With

$$ \begin{align*} \mathbb{E}_{p(x|\theta)}\left[\frac{1}{p(x|\theta)} \frac{\partial^2 p(x|\theta)}{\partial \theta_\partial \theta_j}\right]=& \int \limits_{-\infty}^{\infty} p(x|\theta) \frac{1}{p(x|\theta)} \frac{\partial^2 p(x|\theta)}{\partial \theta_i\partial \theta_j} dx \\ =& \int \limits_{-\infty}^{\infty} \frac{\partial^2 p(x|\theta)}{\partial \theta_i\partial \theta_j} dx \\ =& \frac{\partial^2 }{\partial \theta_i\partial \theta_j}\int \limits_{-\infty}^{\infty} p(x|\theta) dx \\ =& \frac{\partial^2 }{\partial \theta_i\partial \theta_j}1 \\ =& 0 . \end{align*} $$
$$ \begin{align*} \mathbb{E}_{p(x|\theta)}\left[\frac{1}{p(x|\theta)} \nabla^2_{\theta} \log p(x|\theta)\right]=& \int \limits_{-\infty}^{\infty} p(x|\theta) \frac{1}{p(x|\theta)} \nabla^2_{\theta} \log p(x|\theta) dx \\ =& \int \limits_{-\infty}^{\infty} \nabla^2_{\theta} \log p(x|\theta) dx \\ =& \nabla^2_{\theta}\int \limits_{-\infty}^{\infty} p(x|\theta) dx \\ =& \nabla^2_{\theta}1 \\ =&0 \end{align*}$$

Scalar

Vector

follows

$$ \left[\mathcal{I(\theta)}\right]_{ij} = \mathbb{E}_{p(x|\theta)}\left[\frac{\partial}{\partial \theta_j} \log p(x|\theta) \frac{\partial}{\partial \theta_i} \log p(x|\theta)\right] $$
$$ \mathcal{I(\theta)} = \mathbb{E}_{p(x|\theta)}\left[\nabla_{\theta} \log p(x|\theta) \nabla_{\theta} \log p(x|\theta)^T\right] $$

Scalar

Vector

… curvature of the cross entropy

Cross entropy and Kullback-Leibler divergence only differ by constant additive term. Therefore, curvature has to be identical.

… expected negative curvature of log-likelihood

$$ \begin{align*} \left[\mathcal{I(\theta)}\right]_{ij} =& -\mathbb{E}_{p(x|\theta)}\left[\frac{\partial^2 \log p(x|\theta)}{\partial \theta_i \partial \theta_j}\right] \\ \end{align*}$$
$$ \begin{align*} \mathcal{I(\theta)} =& -\mathbb{E}_{p(x|\theta)}\left[\nabla_{\theta}^2 \log p(x|\theta)\right]\\ \end{align*}$$

Scalar

Vector

See curvature of the KL-divergence.

… expected value of the observed information

$$ \left[\mathcal{I(\theta)}\right]_{ij} = \mathbb{E}_{p(x|\theta)}\left[\left[\mathcal{J(\theta)}\right]_{ij}\right] $$
$$ \mathcal{I(\theta)} = \mathbb{E}_{p(x|\theta)}\left[\mathcal{J(\theta)}\right] $$

Scalar

Vector

where the observed information \(\mathcal{J(\theta)}\) is defined as

$$ \left[\mathcal{J(\theta)}\right]_{ij} = -\frac{\partial^2 \log p(x|\theta)}{\partial \theta_i \partial \theta_j} $$
$$ \mathcal{J(\theta)} = -\nabla_{\theta}^2 \log p(x|\theta) $$

Scalar

Vector

We note that this case has the same form as the curvature of the KL-divergence.