The concept and the equations of the Kalman filter can be quite confusing at the beginning. Often the assumptions are not stated clearly and the equations are just falling from the sky. This post is an attempt to derive the equations of the Kalman filter in a systematic and hopefully understandable way using Bayesian inference. It addresses everyone, who wants to get a deeper understanding of the Kalman filter and is equipped with basic knowledge of linear algebra and probability theory.

First of all, let’s try to formulate the main idea of Kalman filtering in one sentence:

The Kalman filter is used to infer the current state of a linear Gaussian state space model given all observations and inputs up to the current timestep and a Gaussian prior distribution of the initial state.

Indeed, the process of Kalman filtering is simply Bayesian inference in the domain of linear Gaussian state space models. The information encoded in our formulation is sufficient to uniquely define what the Kalman filter should output. But it doesn’t tell us anything about how to compute it. In this article, we will find an efficient recursive method that will lead us to the familiar equations.

Let’s get started with the derivation by defining linear Gaussian state space models and Bayesian inference.

Linear Gaussian state space models

A linear Gaussian state space model is defined by

$\begin{align}x_{t+1} &= A_tx_t + B_t u_t + w_t \\ y_t &= C_tx_t + v_t \end{align}$

with state $x_t$, output $y_t$, input $u_t$, system matrix $A_t$, input matrix $B_t$, output matrix $C_t$, Gaussian process noise $ w_t \sim \mathcal{N}(w_t|0, Q_t) $ and Gaussian observation noise $ v_t \sim \mathcal{N}(v_t|0, R_t) $.

Alternatively, we can use the probability density functions

$\begin{align}x_{t+1} &\sim \mathcal{N}(x_{t+1}|A_tx_t + B_t u_t, Q_t)\\ y_t &\sim \mathcal{N}(y_t|C_tx_t, R_t)\end{align}$

to describe the system in a more compact fashion.

Linear Gaussian state space models can also be described in the language of probabilistic graphical models (or more precisely in the language of Bayesian networks). The figure below shows such a model up to time $T = 4$.

Every node represents a random variable and the edges are representing conditional dependencies between the respective nodes. Random variables, that are observed (or given) are shaded in light blue. In our case this is the output $y_t$ and the input $u_t$. The state $x_t$ is not observed (or latent).

Bayesian inference

In simplest terms Bayesian inference tries to update a hypothesis/belief of something, that is not directly observable, in the face of new information by using Bayes’ rule. A bit more formal: The goal is to update the prior distribution $p(x)$ given new data $\mathcal{D}$ to obtain the posterior distribution $p(x|\mathcal{D})$ with help of Bayes rule

$p(x|\mathcal{D}) = \frac{p(\mathcal{D}|x)p(x)}{p(\mathcal{D})}$

with likelihood $p(\mathcal{D}|x)$ and evidence $p(\mathcal{D})$.

This idea is very general and can be applied to dynamical models quite easily. The most common inference tasks in dynamical models are filtering, smoothing and prediction. These methods differ only in the form of the posterior distribution.

Filtering: What is my belief about the current state $x_t$ given all observations and inputs?

$$ p(x_t|y_0,...,y_t,u_0,...,u_{t-1}) $$

Smoothing: What is my belief about all states $x_t, … ,x_0$ given all observations and inputs?

$$ p(x_t,...,x_0|y_0,...,y_t,u_0,...,u_{t-1}) $$

Prediction: What is my belief about the next state $x_{t+1}$ given all observations and inputs?

$$ p(x_{t+1}|y_0,...,y_t,u_0,...,u_{t-1}) $$

The name Kalman filter reveals, that we will be interested in the filtering problem. Therefore, we want to infer the current state $x_t$ based on all recent observations $y_0,…,y_t$ and inputs $u_0,…,u_{t-1}$. Now that we have defined what we are looking for, let’s try to find a way to efficiently calculate it. We will start by finding a recursive method for general dynamical models defined by the probabilistic graphical model above.

Bayes filter for state space models

We have the task to calculate $ p(x_{t}|y_0,…,y_t,u_0,…,u_{t-1}) $. For this purpose only the structure of the graphical model will matter: it governs the conditional dependencies. To unclutter the notation we will use $\Box_{n:m}$ for $\Box_n,…,\Box_m$.

With help of Bayes’ rule we can rewrite the formula as

$p(x_t|y_{0:t},u_{0:t-1}) = \frac{p(y_t|x_t,y_{0:t-1},u_{0:t-1})p(x_t|y_{0:t-1},u_{0:t-1})}{p(y_t|y_{0:t-1},u_{0:t-1})}.$

If you are not very familiar with Bayes’ rule this can be quite confusing. There are much more moving parts than in the very simple definition. Nonetheless, there is an intuitive explanation. It is Bayes’ rule applied in a world, where we already observed $\mathcal{W}$ in the past (every term is conditioned on $\mathcal{W}$):

$p(x|\mathcal{D},\mathcal{W}) = \frac{p(\mathcal{D}|x,\mathcal{W})p(x|\mathcal{W})}{p(\mathcal{D}|\mathcal{W})}.$

In our case $x:=x_t $, $\mathcal{D}:=y_t $ and $\mathcal{W}:=(y_{0:t-1},u_{0:t-1}) $.

We note that $y_t$ is independent of $y_{0:t-1}$ and $u_{0:t-1}$ given $x_t$. It follows

$p(x_t|y_{0:t},u_{0:t-1}) = \frac{p(y_t|x_t)p(x_t|y_{0:t-1},u_{0:t-1})}{p(y_t|y_{0:t-1},u_{0:t-1})}.$

This conditional independence property is not obvious as well. When it comes to conditional dependencies, it is always a good idea to look at the graphical model.

In the figure above we notice that the node $x_t$ is shaded (observed). This node blocks the way of $y_{0:t-1}$ and $u_{0:t-1}$ to $y_t$. We have proven the conditional independence visually. You can learn more about conditional independence in probabilistic graphical models in Pattern Recognition and Machine Learning (Chapter 8.2).

The denominator is simply the integral of the numerator

$p(y_t|y_{0:t-1},u_{0:t-1}) = \int_{x_t} p(y_t|x_t)p(x_t|y_{0:t-1},u_{0:t-1}) dx_t .$

Great! We successfully expressed our equation in simpler terms. In return, we obtained the new expression $p(x_t|y_{0:t-1},u_{0:t-1})$, which we have to calculate as well. Using marginalization we can express it as

$p(x_t|y_{0:t-1},u_{0:t-1}) = \int_{x_{t-1}} p(x_t,x_{t-1}|y_{0:t-1},u_{0:t-1}) dx_{t-1}.$

We can split the expression in the integral with product rule, which leads to

$p(x_t|y_{0:t-1},u_{0:t-1}) = \int_{x_{t-1}} p(x_t|x_{t-1},y_{0:t-1},u_{0:t-1})p(x_{t-1}|y_{0:t-1},u_{0:t-1}) dx_{t-1}.$

Note that $x_t$ is independent of $y_{0:t-1}$ and $u_{0:t-2}$ given $x_{t-1}$. Furthermore, $x_{t-1}$ is independent of $u_{t-1}$. We obtain

$p(x_t|y_{0:t-1},u_{0:t-1}) = \int_{x_{t-1}} p(x_t|x_{t-1}, u_{t-1})p(x_{t-1}|y_{0:t-1},u_{0:t-2}) dx_{t-1}.$

We note that $p(x_{t-1}|y_{0:t-1},u_{0:t-2})$ has the same form as our expression we started from only shifted by one time step. Our recursive formula is complete!

Let’s summarize our results!

Bayes filter for state space models

The recursive formula for the Bayes filter in state space models consists of the prediction step

$p(x_{t+1}|y_{0:t},u_{0:t}) = \int_{x_{t}} p(x_{t+1}|x_{t}, u_{t})p(x_{t}|y_{0:t},u_{0:t-1}) dx_{t}$

and the update step

$p(x_t|y_{0:t},u_{0:t-1}) = \frac{p(y_t|x_t)p(x_t|y_{0:t-1},u_{0:t-1})}{p(y_t|y_{0:t-1},u_{0:t-1})} .$

The recursion is started with the prior distribution over the initial state $p(x_0)$.

Up to this point, we assumed that we obtain exactly one observation at every timestep. This rather limiting assumption is violated in many real-life scenarios. Multiple or even no observations per timestep are possible. This behavior is exemplified in the probabilistic graphical model below.

Fortunately, handling these cases is very simple. For every observation we make, we calculate the update step with the newest estimate available. Furthermore, it is not necessary that the observations are coming from the same output function (illustrated by the outputs $y_2$ and $z_2$ at $t=2$). Information integration/fusion is very natural in Bayesian inference.

Nice! We just derived the equations of the Bayes filter for general state space models! Now let’s translate this into the linear state space scenario.

Bayes filter in linear Gaussian state space models

Let’s start by identifying the probability distributions we already know:

$\begin{align} p(x_{t+1}|x_{t}, u_{t}) &= \mathcal{N}(x_{t+1}|A_tx_t + B_t u_t, Q_t) \\ p(y_t|x_t) &= \mathcal{N}(y_t|C_tx_t, R_t). \end{align}$

Furthermore, we assume that the prior distribution of the initial state is Gaussian as well. All probability distributions in our model are Gaussian. Therefore, the distributions $p(x_t|y_{0:t-1},u_{0:t-1})$ and $p(x_t|y_{0:t},u_{0:t-1})$ will also be in form of Gaussian distributions, because our recursive formula is only using marginalization and Bayes’ rule, which are closed under Gaussian distributions. In the context of Kalman filtering, these are normally defined by

$\begin{align} p(x_t|y_{0:t},u_{0:t-1}) &:= \mathcal{N}(x_{t}|\hat x_{t|t}, P_{t|t}) \\ p(x_t|y_{0:t-1},u_{0:t-1}) &:= \mathcal{N}(x_{t}|\hat x_{t|t-1}, P_{t|t-1}) . \end{align}$

Please note, that these distributions are still implicitly dependent on the inputs and outputs. The mean and the covariance are a sufficient statistic of the in- and outputs.

The index $\Box_{n|m}$ of the parameters indicates that the state at time $n$ is estimated, based on the outputs upto time $m$. The expression $\hat x_{t|t}$ is called the updated state estimate and $ P_{t|t}$ the updated error covariance. Moreover, $\hat x_{t|t-1}$ is called the predicted state estimate and $ P_{t|t-1}$ the predicted error covariance.

In summary, these are the equations for the Bayes filter in linear Gaussian state space models:

Prediction step

$\mathcal{N}(x_{t+1}|\hat x_{t+1|t}, P_{t+1|t}) = \int_{x_t}\mathcal{N}(x_{t+1}|A_tx_t + B_t u_t, Q_t)\mathcal{N}(x_t|\hat x_{t|t}, P_{t|t}) dx_t.$

Update step

$\mathcal{N}(x_{t}|\hat x_{t|{t}}, P_{t|t} ) = \frac{\mathcal{N}(y_{t}|C_tx_{t}, R_t )\mathcal{N}(x_{t}|\hat x_{t|t-1}, P_{t|t-1})}{\int_{x_{t}}\mathcal{N}(y_{t}|C_tx_{t}, R_t )\mathcal{N}(x_{t}|\hat x_{t|t-1}, P_{t|t-1}) dx_{t}}$

Let’s try to simplify these equations!

Prediction step

We will start with the prediction step

$\mathcal{N}(x_{t+1}|\hat x_{t+1|t}, P_{t+1|t}) = \int_{x_t}\mathcal{N}(x_{t+1}|A_tx_t + B_t u_t, Q_t)\mathcal{N}(x_t|\hat x_{t|t}, P_{t|t}) dx_t.$

In order to find a closed form solution of this integral, we could simply plug in the corresponding expressions of the Gaussian distributions and solve the integral. Fortunately, Marc Toussaint already gathered the most important Gaussian identities, which will lighten our workload a lot. To find an expression for our prediction step we can simply use the propagation formula (Formula 37, Toussaint)

$\int_{y}\mathcal{N}(x|a + Fy, A)\mathcal{N}(y|b,B) dy = \mathcal{N}(x|a + Fb, A + FBF^T ).$

By comparison with our expression, we see that

$\begin{align} \hat x_{t+1|t} &= A_t \hat x_{t|t} + B_tu_t \\ P_{t+1|t} &= Q_t + A_t P_{t|t} A_t^T. \end{align}$

Update step

We will start to simplify the update step

by focussing on the numerator first. We notice that we can rewrite it as a joint distribution (Formula 39, Toussaint)

$\mathcal{N}(x|a,A)\mathcal{N}(y|b + Fx,B) = \mathcal{N}\left(\begin{matrix}x \\y\end{matrix}\middle|\begin{matrix}a\\b + Fa \end{matrix},\begin{matrix}A & A^TF^T\\FA & B + FA^TF^T\end{matrix}\right) .$

Then again, this joint distribution can be rewritten as

$\mathcal{N}\left(\begin{matrix}x \\y\end{matrix}\middle|\begin{matrix}d\\e \end{matrix},\begin{matrix}D & F\\F^T & E\end{matrix}\right) = \mathcal{N}(y|e,E)\mathcal{N}(x|d + F^TE^{-1}(y-e),D - F^T E^{-1}F) .$

We can combine the two previous equations to the following expression

$\mathcal{N}(x|a,A)\mathcal{N}(y|b + Fx,B) = \mathcal{N}(y|b + Fa,B + FA^TF^T) \mathcal{N}(x|a + A^TF^T(B + FA^TF^T)^{-1}(y-b -Fa),A - A^TF^T (B + FA^TF^T)^{-1}FA) .$

By comparison with the numerator of our update step, we obtain

$\mathcal{N}(x_{t}|\hat x_{t|t-1}, P_{t|t-1})\mathcal{N}(y_{t}|C_tx_{t}, R_t ) = \mathcal{N}(y_{t}|C_t\hat x_{t|t},R_t + C_tP_{t|t-1}C_t^T) \mathcal{N}(x_{t}|\hat x_{t|t-1} + P_{t|t-1}C_t^T(R_t + C_tP_{t|t-1}C_t^T)^{-1}(y_{t}-C_t\hat x_{t|t-1}), P_{t|t-1} - P_{t|t-1}C_t^T (R_t + C_tP_{t|t-1}C_t^T)^{-1}C_tP_{t|t-1}).$

At a first glance, this is not looking like a simplification at all. Conceptually, we only transformed

$\frac{p(y|x)p(x)}{p(y)} \to \frac{p(y,x)}{p(y)} \to \frac{p(x|y)p(y)}{p(y)}.$

If we look closely at the final expression, we see that $p(y)$ is canceling out. Therefore, the result is simply the remaining part

$\mathcal{N}(x_{t}|\hat x_{t|{t}}, P_{t|t} ) = \mathcal{N}(x_{t}|\hat x_{t|t-1} + P_{t|t-1}C_t^T(R_t + C_tP_{t|t-1}C_t^T)^{-1}(y_{t}-C_t\hat x_{t|t-1}),P_{t|t-1} - P_{t|t-1}C_t^T (R_t + C_tP_{t|t-1}C_t^T)^{-1}C_tP_{t|t-1}).$

If our reasoning is correct the denominator should be equal to $\mathcal{N}(y_{t}|C_t\hat x_{t|t},R_t + C_tP_{t|t-1}C_t^T)$, which was canceled out. The denominator can be simplified with the propagation formula (Formula 37, Toussaint)

$\int_{x_{t}}\mathcal{N}(y_{t}|C_tx_{t}, R_t )\mathcal{N}(x_{t}|\hat x_{t|t-1}, P_{t|t-1}) dx_{t} = \mathcal{N}({y_{t}}|C_t\hat x_{t|t-1}, R_t + C_tP_{t|t-1}C_t^T ).$

Yay! We see, that the denominator is exactly the same as the canceled factor in the numerator.

Let’s summarize our results:

Bayes filter in linear Gaussian state space models

The recursive formula for the Bayes filter in linear Gaussian state space models consists of the prediction step

$\begin{align}\hat x_{t+1|t} &= A_t \hat x_{t|t} + B_tu_t \\ P_{t+1|t} &= Q_t + A_t P_{t|t} A_t^T \end{align}$

and the update step

$\begin{align}\hat x_{t|t} &= \hat x_{t|t-1} + P_{t|t-1}C_t^T(R_t + C_tP_{t|t-1}C_t^T)^{-1}(y_{t}-C_t\hat x_{t|t-1}) \\ P_{t|t} &= P_{t|t-1} - P_{t|t-1}C_t^T (R_t + C_tP_{t|t-1}C_t^T)^{-1}C_tP_{t|t-1}. \end{align}$

That’s it! We derived the equations of the Bayes filter in linear Gaussian state space models, which is nothing else but the good old Kalman filter. In the next section, we will split these equations up to finally obtain the formulation normally used for the Kalman filter.

Kalman filter

In order to obtain the familiar equations of the Kalman filter we have to define

Innovation
$z_t = y_{t}-C_t\hat x_{t|t-1}$
Innovation covariance
$S_t = R_t + C_tP_{t|t-1}C_t^T$
Optimal Kalman gain
$K_t = P_{t|t-1}C_t^TS_t^{-1}$

What is the meaning of $z_t$ and $S_t$?

The denominator of the update step is

$\mathcal{N}(y_{t}|C_t\hat x_{t|t-1},R_t + C_tP_{t|t-1}^TC_t^T)$

and can be transformed by (Formula 34, Toussaint)

$\mathcal{N}(x|a,A) = \mathcal{N}(x+f|a+f,A)$

to obtain the expression

$\mathcal{N}(y_{t} - C_t\hat x_{t|t-1}|0,R_t + C_tP_{t|t-1}^TC_t^T) = \mathcal{N}(z_t|0,S_t).$

Therefore, the innovation $z_t$ is the deviation of the expected output and the observed output. The random variable $z_t$ has a Gaussian distribution with zero mean and variance $S_t$.

Let’s plug these definitions into the equations of our update step

$\hat x_{t|t} = \hat x_{t|t-1} + \underbrace{P_{t|t-1}C_t^T(\underbrace{R_t + C_tP_{t|t-1}C_t^T}_{S_t})^{-1}}_{K_t}(\underbrace{y_{t}-C_t\hat x_{t|t-1}}_{z_t})$ $P_{t|t} = P_{t|t-1} - \underbrace{P_{t|t-1}C_t^T(\underbrace{R_t + C_tP_{t|t-1}C_t^T}_{S_t})^{-1}}_{K_t}P_{t|t-1} .$

This leads us to the final equations of the Kalman filter.

Equations of the Kalman filter

The recursive formula for the Kalman filter consists of the prediction step

$\begin{align}\hat x_{t+1|t} &= A_t \hat x_{t|t} + B_tu_t \\ P_{t+1|t} &= Q_t + A_t P_{t|t} A_t^T \end{align}$

and the update step

$\begin{align} z_t &= y_{t}-C_t\hat x_{t|t-1}\\ S_t &= R_t + C_tP_{t|t-1}C_t^T\\ K_t &= P_{t|t-1}C_t^TS_t^{-1} \\ \hat x_{t|t} &= \hat x_{t|t-1} + K_t z_t\\ P_{t|t} &= (I - K_tC_t)P_{t|t-1}. \end{align}$

Summary

This article presented the derivation of the Kalman filter from first principles using Bayesian inference. The goal was to derive the Kalman filter in a clear and straightforward fashion. The steps were designed to be as atomic as possible, in order to be comprehensible for readers, who are not so familiar with the tools we used. Summarized, the derivation was performed in the following four subsequent steps:

We realized, that we have to calculate $ p(x_{t}|y_0,…,y_t,u_0,…,u_{t-1}) $.
Derived the recursive equations of the Bayes filter to efficiently calculate this distribution.
Inserted the corresponding distributions of the linear Gaussian state space model.
Added some “sugar” to obtain the usual equations of the Kalman filter.