This article is the second part of the nonlinear filtering series, where we will derive the extended Kalman filter with non-additive and additive noise directly from the recursive equations of the Bayes filter. If you haven’t read the introduction, I would recommend to read it first. Before we dive into the derivation, let’s try to state the main idea behind extended Kalman filter.

The extended Kalman filter approximates the Bayes filter by linearizing the system and observation equations.

Derivation

We will start the derivation directly from the recursive equations of the Bayes filter with the prediction step

$p(x_{t+1}|y_{0:t},u_{0:t}) = \int_{x_{t}} p(x_{t+1}|x_{t}, u_{t})p(x_{t}|y_{0:t},u_{0:t-1}) dx_{t}$

and the update step

$p(x_t|y_{0:t},u_{0:t-1}) = \frac{p(y_t|x_t)p(x_t|y_{0:t-1},u_{0:t-1})}{\int_{x_t}p(y_t|x_t)p(x_t|y_{0:t-1},u_{0:t-1}) \,dx_t} .$

The extended Kalman filter is normally formulated with nonlinear functions with additive noise. In this article, we directly derive the general case for non-additive noise and obtain the extended Kalman filter as a special case. Therefore, our equations of the system are

$\begin{align} x_{t+1} &= f(x_t, u_t, w_t) \\ y_t &= h(x_k, v_k). \end{align}$

with Gaussian process noise $ w_t \sim \mathcal{N}(w_t|0, Q_t) $ and Gaussian observation noise $ v_t \sim \mathcal{N}(v_t|0, R_t) $.

In our formula of the Bayes filter, we can’t find any functions $f(x_t, u_t, w_t)$ or $h(x_k, v_k)$. Therefore, our first step will be to express these functions as probability distributions $p(x_{t+1}|x_{t}, u_{t})$ and $p(y_t|x_t)$. The next box will show how to achieve this in general. Warning: Distributions are very weird and the following treatment is not rigorous.

We want to calculate $p(y|x)$ given the function $y=f(x,z)$ and $p(z)$. We can express the deterministic function $y=f(x,z)$ as probability distribution

$p(y|x,z) = \delta(y-f(x,z)),$

with the Dirac delta function $\delta(x)$.

In order to calculate $p(y|x)$ we can simply marginalize out $z$:

$\begin{align} p(y|x) &= \int_z p(y|x,z)p(z)\, dx \\ &= \int_z \delta(y-f(x,z))p(z)\, dx \end{align}$

By using the composition rule of the Dirac delta function, we can express this as

$p(y|x) = \sum_i \frac{p(z_i)}{\left|\det\nabla_z f(x,z)|_{z_i}\right|},$

where the sum goes over all $z_i$ which satisfy the equation $y = f(x,z)$.

In our case, we can express our system function $x_{t+1} = f(x_t,u_t,w_t)$ as a probability distribution

$p(x_{t+1}|x_t,u_t) = \sum_i \frac{p(w_i)}{\left|\det\nabla_w f(x_t, u_t, w_t)|_{w_i}\right|},$

where the sum goes over all $w_i$ which satisfy $x_{t+1} = f(x_t,u_t,w_t)$.

Similarly, we can express our observation function $y_t = h(x_k, v_k)$ as probability distribution

$p(y_t|x_t) = \sum_i \frac{p(v_i)}{\left|\det\nabla_v h(x_t, v_t)|_{v_i}\right|}$

where the sum goes over all $v_i$ which satisfy $x_{t+1} = h(x_t,v_t)$.

Even if we are assuming Gaussian process and measurement noise, the resulting distributions of our model will, in general, be non-Gaussian. This is where the extended Kalman filter comes into play. It approximates our probability distributions by using linearization. In my limited scope, linearization of a probability distribution makes no sense. But what is linearized instead?

Instead of linearizing our probability distributions we will do this with our deterministic functions before we express them as probability distributions.

In order to linearize, we are performing a first-order Taylor expansion

$f(x_t, u_t, w_t) \approx f(x_t, u_t, w_t)|_{x_t=\hat x_{t|t},u_t=u,w_t=0} + \nabla_{x_t} f(x_t, u_t, w_t)^T|_{x_t=\hat x_{t|t},u_t=u,w_t=0}(x_t - x)+ \nabla_{u_t} f(x_t, u_t, w_t)^T|_{x_t=\hat x_{t|t},u_t=u,w_t=0}(u_t - u) + \nabla_{w_t} f(x_t, u_t, w_t)^T|_{x_t=\hat x_{t|t},u_t=u,w_t=0}w_t$

around the current state estimate $x_t = \hat x_{t|t}$, current input $u_t=u$ and zero process noise $w_t=0$.

To unclutter the notation, we define $A_t = \nabla_{x_t} f(x_t, u_t, w_t)^T|_ {x_t=\hat x_{t|t},u_t=u,w_t=0}$ , $B_t = \nabla_{u_t} f(x_t, u_t, w_t)^T|_ {x_t=\hat x_{t|t},u_t=u,w_t=0}$ and $L_t = \nabla_{w_t} f(x_t, u_t, w_t)^T|_ {x_t=\hat x_{t|t},u_t=u,w_t=0}$ to finally obtain the much cleaner representation

$\hat{f}(x_t, u_t, w_t) = f(\hat x_{t|t}, u, 0) + A_t(x_t - \hat x_{t|t})+ B_t(u_t - u) + L_tw_t.$

Now we are ready to transform our linearized deterministic system function into a probability distribution. We will use the same formula as above, but replace the true function $f(x_t, u_t, w_t)$ with $\hat{f}(x_t, u_t, w_t)$, and arrive at

$p(x_{t+1}|x_t,u_t) = \sum_i \frac{\mathcal{N}(w_i|0,Q_t)}{\left|\det\nabla_w \hat{f}(x_t, u_t, w_t)|_{w_i}\right|},$

where the sum is over all $w_i$ which satisfy $x_{t+1} = \hat{f}(x_t, u_t, w_t)$. Let’s try to simplify this expression! First, we notice that if the matrix $L_t$ is invertible, then there is exactly one $w$ that satisfies $x_{t+1} = \hat{f}(x_t, u_t, w_t)$. We can express it as

$w = L_t^{-1}\left(x_{t+1} - f(\hat x_{t|t}, u, 0) - A_t(x_t - \hat x_{t|t})- B_t(u_t - u)\right).$

Note, that if the martrix $L_t$ is not invertible, we would have to integrate over the whole null space and the sum would become an integral.

Next, let’s look at the denominator. We can express the derivative of $\hat{f}(x_t, u_t, w_t)$ with respect to $w_t$ as

$\nabla_w \hat{f}(x_t, u_t, w_t) =\nabla_w( f(\hat x_{t|t}, u, 0) + A_t(x_t - \hat x_{t|t})+ B_t(u_t - u) + L_tw_t) = L_t.$

Let’s plug in the information our new information about $w$ and the derivative:

$p(x_{t+1}|x_t,u_t) = \frac{\mathcal{N}(L_t^{-1}\left(x_{t+1} - f(\hat x_{t|t}, u, 0) - A_t(x_t - \hat x_{t|t})- B_t(u_t - u)\right)|0,Q_t)}{\left|\det L_t\right|}.$

We apply the transformation identity (Formula 35, Toussaint):

$p(x_{t+1}|x_t,u_t) = \frac{\frac{1}{\left|\det L_t^{-1}\right|}\mathcal{N}(x_{t+1} - f(\hat x_{t|t}, u, 0) - A_t(x_t - \hat x_{t|t})- B_t(u_t - u)|0,L_tQ_tL_t^T)}{\left|\det L_t\right|}$

And apply it once more, to obtain our final result:

$p(x_{t+1}|x_t,u_t) = \mathcal{N}(x_{t+1}|f(\hat x_{t|t}, u, 0) + A_t(x_t - \hat x_{t|t})+ B_t(u_t - u),L_tQ_tL_t^T).$

We are done! The linearization of our function has lead us back to Gaussianity!

With the same strategy, we obtain for our observation model

$p(y_t|x_t) = \mathcal{N}(y_T|h(\hat x_{t|t-1}, 0) + C_t(x_t - \hat x_{t|t-1}),M_tQ_tM_t^T),$

with $C_t = \nabla_{x_t} h(x_t, v_t)^T|_ {x_t=\hat x_{t|t-1},v_t=0}$ and $M_t = \nabla_{v_t} h(x_t, v_t)^T|_ {x_t=\hat x_{t|t-1},v_t=0}$

Our linearization led to a Gaussian system and observation model. Therefore, the distribution of the updated state estimate

$p(x_t|y_{0:t},u_{0:t-1}) := \mathcal{N}(x_{t}|\hat x_{t|t}, P_{t|t})$

and the distribution of the predicted state estimate

$p(x_t|y_{0:t-1},u_{0:t-1}) := \mathcal{N}(x_{t}|\hat x_{t|t-1}, P_{t|t-1})$

will be Gaussians as well.

Now we are ready to plug our surrogate into the equations of the Bayes filter:

Prediction step

$\mathcal{N}(x_{t+1}|\hat x_{t+1|t}, P_{t+1|t}) = \int_{x_t}\mathcal{N}\left(x_{t+1}\middle| f(x, u, 0) + A_t(x_t - \hat x_{t|t})+ B_t(u_t - u),L_t^TQ_tL_t\right) \mathcal{N}(x_t|\hat x_{t|t}, P_{t|t}) dx_t.$

Update step

$\mathcal{N}(x_{t}|\hat x_{t|{t}}, P_{t|t} ) = \frac{\mathcal{N}\left(y_{t}\middle| h(x, 0) C_t(x_t - \hat x_{t|t-1}) ,M_t^TR_tM_t\right) \mathcal{N}(x_{t}|\hat x_{t|t-1}, P_{t|t-1})}{\int_{x_{t}}\mathcal{N}\left(y_{t}\middle| h(x, 0) C_t(x_t - \hat x_{t|t-1}) ,M_t^TR_tM_t\right) \mathcal{N}(x_{t}|\hat x_{t|t-1}, P_{t|t-1}) dx_{t}}$

Let’s try to simplify these equations!

Prediction step

We will start with the prediction step

$\mathcal{N}(x_{t+1}|\hat x_{t+1|t}, P_{t+1|t}) = \int_{x_t}\mathcal{N}\left(x_{t+1}\middle| f(x, u, 0) - A_t\hat x_{t|t} - B_tu + A_tx_t + B_tu_t,L_t^TQ_tL_t\right) \mathcal{N}(x_t|\hat x_{t|t}, P_{t|t}) dx_t.$

To find an expression for our prediction step we can simply use the propagation formula (Formula 37, Toussaint)

$\int_{y}\mathcal{N}(x|a + Fy, A)\mathcal{N}(y|b,B) dx_t = \mathcal{N}(x|a + Fb, A + FBF^T ).$

By comparison with our expression, we see that

$\hat x_{t+1|t} = f(x_{t|t}, u_t, 0) -A_t \hat x_{t|t} - B_tu_t + A_t \hat x_{t|t} + B_tu_t = f(x_{t|t}, u_t, 0)$ $P_{t+1|t} = L_t^TQ_tL_t + A_t P_{t|t} A_t^T .$

Update step

We will start to simplify the update step

$\mathcal{N}(x_{t}|\hat x_{t|{t}}, P_{t|t} ) = \frac{\mathcal{N}\left(y_{t}\middle| h(\hat x_{t|t-1}, 0) - C_t\hat x_{t|t-1} + C_tx_t ,M_t^TR_tM_t\right) \mathcal{N}(x_{t}|\hat x_{t|t-1}, P_{t|t-1})}{\int_{x_{t}}\mathcal{N}\left(y_{t}\middle| h(\hat x_{t|t-1}, 0) - C_t\hat x_{t|t-1} + C_tx_t ,M_t^TR_tM_t\right) \mathcal{N}(x_{t}|\hat x_{t|t-1}, P_{t|t-1}) dx_{t}}$

by focussing on the numerator first. We notice that we can rewrite it as a joint distribution (Formula 39, Toussaint)

$\mathcal{N}(x|a,A)\mathcal{N}(y|b + Fx,B) = \mathcal{N}\left(\begin{matrix}x \\y\end{matrix}\middle|\begin{matrix}a\\b + Fa \end{matrix},\begin{matrix}A & A^TF^T\\FA & B + FA^TF^T\end{matrix}\right) .$

Then again, this joint distribution can be rewritten as

$\mathcal{N}\left(\begin{matrix}x \\y\end{matrix}\middle|\begin{matrix}d\\e \end{matrix},\begin{matrix}D & F\\F^T & E\end{matrix}\right) = \mathcal{N}(y|e,E)\mathcal{N}(x|d + F^TE^{-1}(y-e),D - F^T E^{-1}F) .$

We can combine the two previous equations to the following expression

$\mathcal{N}(x|a,A)\mathcal{N}(y|b + Fx,B) = \mathcal{N}(y|b + Fa,B + FA^TF^T) \mathcal{N}(x|a + A^TF^T(B + FA^TF^T)^{-1}(y-b -Fa),A - A^TF^T (B + FA^TF^T)^{-1}FA) .$

By comparison with the numerator of our update step, we obtain

$\mathcal{N}\left(y_{t}\middle| h(\hat x_{t|t-1}, 0) - C_t x_{t|t-1} + C_t x_{t|t-1} ,M_t^TR_tM_t\right) \mathcal{N}(x_{t}|\hat x_{t|t-1}, P_{t|t-1}) = \mathcal{N}(y_{t}|h(\hat x_{t|t-1}, 0) - C_t x_{t|t-1} + C_t\hat x_{t|t-1},M_t^TR_tM_t + C_tP_{t|t-1}C_t^T) \mathcal{N}(x_{t}|\hat x_{t|t-1} + P_{t|t-1}C_t^T(M_t^TR_tM_t + C_tP_{t|t-1}C_t^T)^{-1}(y_{t}-h(\hat x_{t|t-1}, 0) + C_t x_{t|t-1} -C_t\hat x_{t|t-1}), P_{t|t-1} - P_{t|t-1}C_t^T (M_t^TR_tM_t + C_tP_{t|t-1}C_t^T)^{-1}C_tP_{t|t-1}),$

which simplifies to

$\mathcal{N}\left(y_{t}\middle| h(\hat x_{t|t-1}, 0) ,M_t^TR_tM_t\right) \mathcal{N}(x_{t}|\hat x_{t|t-1}, P_{t|t-1}) = \mathcal{N}(y_{t}|h(\hat x_{t|t-1}, 0),M_t^TR_tM_t + C_tP_{t|t-1}C_t^T) \mathcal{N}(x_{t}|\hat x_{t|t-1} + P_{t|t-1}C_t^T(M_t^TR_tM_t + C_tP_{t|t-1}C_t^T)^{-1}(y_{t}-h(\hat x_{t|t-1}, 0)), P_{t|t-1} - P_{t|t-1}C_t^T (M_t^TR_tM_t + C_tP_{t|t-1}C_t^T)^{-1}C_tP_{t|t-1}).$

We applied the same trick as in the derivation of the Kalman filter: Conceptually, we only transformed

$\frac{p(y|x)p(x)}{p(y)} \to \frac{p(y,x)}{p(y)} \to \frac{p(x|y)p(y)}{p(y)}.$

If we look closely at the final expression, we see that $p(y)$ is canceling out. Therefore, the result is simply the remaining part

$\mathcal{N}(x_{t}|\hat x_{t|{t}}, P_{t|t} ) = \mathcal{N}(x_{t}|\hat x_{t|t-1} + P_{t|t-1}C_t^T(M_t^TR_tM_t + C_tP_{t|t-1}C_t^T)^{-1}(y_{t}-h(\hat x_{t|t-1}, 0)), P_{t|t-1} - P_{t|t-1}C_t^T (M_t^TR_tM_t + C_tP_{t|t-1}C_t^T)^{-1}C_tP_{t|t-1}).$

If our reasoning is correct the denominator should be equal to $\mathcal{N}(y_{t}|h(x, 0),M_t^TR_tM_t + C_tP_{t|t-1}C_t^T)$, which was canceled out. The denominator can be simplified with the propagation formula (Formula 37, Toussaint)

$\int_{x_{t}}\mathcal{N}\left(y_{t}\middle| h(\hat x_{t|t-1}, 0) - C_t x_{t|t-1} + C_t x_{t|t-1} ,M_t^TR_tM_t\right) \mathcal{N}(x_{t}|\hat x_{t|t-1}, P_{t|t-1}) dx_{t} = \mathcal{N}({y_{t}}|h(\hat x_{t|t-1}, 0) - C_t x_{t|t-1} + C_t\hat x_{t|t-1}, M_t^TR_tM_t + C_tP_{t|t-1}C_t^T ) = \mathcal{N}(y_{t}|h(\hat x_{t|t-1}, 0),M_t^TR_tM_t + C_tP_{t|t-1}C_t^T).$

Yay! We see, that the denominator is exactly the same as the canceled factor in the numerator.

Let’s summarize our results:

Extended Kalman filter with non-additive noise

The recursive formula for the extended Kalman filter with non-additive noise consists of the prediction step

$\begin{align}\hat x_{t+1|t} &= f(x_{t|t}, u_t, 0) \\ P_{t+1|t} &= L_t^TQ_tL_t + A_t P_{t|t} A_t^T \end{align}$

and the update step

$\begin{align}\hat x_{t|t} &= \hat x_{t|t-1} + P_{t|t-1}C_t^T(M_t^TR_tM_t + C_tP_{t|t-1}C_t^T)^{-1}(y_{t}-h(\hat x_{t|t-1}, 0)) \\ P_{t|t} &= P_{t|t-1} - P_{t|t-1}C_t^T (M_t^TR_tM_t + C_tP_{t|t-1}C_t^T)^{-1}C_tP_{t|t-1} \end{align}$

with

$\begin{align} A_t &= \nabla_{x_t} f(x_t, u_t, w_t)^T|_ {x_t=\hat x_{t|t},u_t=u,w_t=0}\\ B_t &= \nabla_{u_t} f(x_t, u_t, w_t)^T|_ {x_t=\hat x_{t|t},u_t=u,w_t=0}\\ C_t &= \nabla_{x_t} h(x_t, v_t)^T|_ {x_t=\hat x_{t|t-1},v_t=0}\\ L_t &= \nabla_{w_t} f(x_t, u_t, w_t)^T|_ {x_t=\hat x_{t|t},u_t=u,w_t=0}\\ M_t &= \nabla_{v_t} h(x_t, v_t)^T|_ {x_t=\hat x_{t|t-1},v_t=0}. \end{align}$

That’s it! We derived the equations of the extended Kalman filter. To bring the equations in a more implementation friendly form, we are restating the extended Kalman filter as:

Extended Kalman filter with non-additive noise

The recursive formula for the extended Kalman filter with non-additive noise consists of the prediction step

$\begin{align}\hat x_{t+1|t} &= f(x_{t|t}, u_t, 0) \\ P_{t+1|t} &= L_t^TQ_tL_t + A_t P_{t|t} A_t^T \end{align}$

and the update step

$\begin{align} z_t &= y_{t}-h(\hat x_{t|t-1}, 0)\\ S_t &= M_t^TR_tM_t + C_tP_{t|t-1}C_t^T\\ K_t &= P_{t|t-1}C_t^TS_t^{-1} \\ \hat x_{t|t} &= \hat x_{t|t-1} + K_t z_t\\ P_{t|t} &= (I - K_tC_t)P_{t|t-1} \end{align}$

with

As promised we will also look at the special case with additive noise. Therefore, our functions will look like:

$f(x_t, u_t, w_t) = \bar{f}(x_t, u_t) + w_t.$ $h(x_t, v_t) = \bar{h}(x_t) + v_t.$

In this case, the matrix $L_t$ and $M_t$ will be identity matrices:

$L_t = \nabla_{w_t} f(x_t, u_t, w_t)|_{x_t=x,v_t=0} = \underbrace{\nabla_{w_t} \bar{f}(x_t, u_t)}_{0}|_{x_t=x,v_t=0} + \underbrace{\nabla_{w_t} w_t}_{I}|_{x_t=x,v_t=0} = I$ $M_t = \nabla_{v_t} h(x_t, v_t)|_{x_t=x,v_t=0} = \underbrace{\nabla_{v_t} \bar{h}(x_t)}_{0}|_{x_t=x,v_t=0} + \underbrace{\nabla_{v_t} v_t}_{I}|_{x_t=x,v_t=0} = I.$

Finally, we can state the equations of the extended Kalman filter with additive noise.

Extended Kalman filter with additive noise

The recursive formula for the extended Kalman filter with additive noise consists of the prediction step

$\begin{align}\hat x_{t+1|t} &= f(x_{t|t}, u_t) \\ P_{t+1|t} &= Q_t + A_t P_{t|t} A_t^T \end{align}$

and the update step

$\begin{align} z_t &= y_{t}-h(\hat x_{t|t-1})\\ S_t &= R_t + C_tP_{t|t-1}C_t^T\\ K_t &= P_{t|t-1}C_t^TS_t^{-1} \\ \hat x_{t|t} &= \hat x_{t|t-1} + K_t z_t\\ P_{t|t} &= (I - K_tC_t)P_{t|t-1} \end{align}$

with

Example

Enough of the dry theory! Let’s play around with the grid-based filter in our race track example.

Backward

No action

Forward

Predict step

Observe

Update step

On the outside of the race track, you will notice a blue colored strip. This strip represents our current posterior of the current position of the race car. We will start with a Gaussian prior distribution around the true mean. By pressing the OBSERVE button two things will happen: first, we will take a measurement of the distance to the tree and second, we will display the likelihood for this observed distance on the brown strip inside the race track. By pressing the UPDATE STEP button, we will perform our update step and show the resulting posterior at the outer strip. Now we are ready for the next time step. Take an action, by pressing the corresponding button below the race track. After the step is performed, you have to update your posterior by pressing the PREDICT STEP button. You will see that the outer strip will change accordingly. Now we finished one full cycle of the filtering process and are ready to start a new cycle by taking a measurement.

What if our distance meter is not working anymore? By either clicking on the tree or pressing the W button on your keyboard, you can turn off your measurement device. Therefore, the observation and update step will be skipped. The tree will become opaque, if your measurement device is turned off.

If you want to reset the environment, just press the reset button in the bottom left corner or press the R button on your keyboard. As before you can control the car by using your keyboard: A (Backward), S (Stop), D (Forward) or the buttons below the race track.

We see, that it is actually working pretty well. But one thing seems particularly weird: At certain positions on the race track, the brownish inner strip (our likelihood) seems to be uniformly distributed. This is not a bug, but a shortcoming of the linearization. We will always experience this behavior, if the part of the road at our current posterior mean is pointing only in tangential direction, with the tree as the center. In other words: A small change in position wouldn’t change the distance. When we are linearizing, we assume this local behavior applies for the whole system and we won’t get any new information out of our measurement. To get an intuitive understanding of this, you can imagine two parallel lines. Our car is driving along one of the parallel lines and we take the nearest distance to the other line as a measurement. This measurement would be uninformative because the distance to the parallel line is always the same.

Are you curious about the derivation of unscented Kalman filter? Then you should definitely check out the next article of the nonlinear filter series covering the unscented Kalman filter. See you there!

Acknowledgement

The vector graphics of the car were created by Freepik.