The unscented Kalman filter describes another method for approximating the process of non-linear Bayes filtering. In this article, we will derive the corresponding equations directly from the general Bayes filter. Furthermore, we will get to know a different way to think about the unscented transform. If you haven’t read the introduction, I would recommend to read it first. Before we dive into the derivation, let’s try to state the main idea behind the unscented Kalman filter.

The unscented Kalman filter approximates the Bayes filter by approximating the system and observation equations locally. A set of probe points is used to infer the local behavior of the function around the current estimate.

When I first learned about the unscented Kalman filter, it seemed that it was not derived from the general Bayes filter

but rather directly from the Kalman filter

In this post, I want to explore how to derive the unscented Kalman filter from general Bayes filter directly. But there is one catch, to obtain the equations of the unscented Kalman filter, you have to do some steps that seem hard to justify. It is totally possible, that there are deep thoughts behind it, but my limited scope is preventing me to identify them. Nonetheless, I will start to derive a different version of the unscented Kalman filter that is in some regards more principled. In the end, I will adjust this version to match the normal unscented Kalman filter.

Let’s start with the derivation!

Derivation

First of all, if we want to apply the unscented Kalman filter, we assume that we have nonlinear models with additive noise

Unlike the extended Kalman filter, which is based on linearization, the unscented Kalman filter is based around the unscented transform. In simplest terms, if you want to transform a Gaussian distribution

through a nonlinear function, it will give you an estimate of the resulting mean and covariance.

For the derivation of the equations of unscented Kalman filter, I will state the unscented transform in a slightly different form, which I will call the joint probability interpretation of the unscented transform.

Joint probability interpretation of the unscented transform

In this section, we explore an interpretation of the unscented transform in terms of the joint probability. We are in a setting with a Gaussian conditional distribution with nonlinear mean and constant variance

and a Gaussian prior \(p(x) = \mathcal{N}(y|\mu, \Sigma) \).

Conceptually, we will first replace the prior with a surrogate distribution \(\hat{p}(x)\). Subsequently, we will approximate the joint distribution of the true conditional distribution \(p(y|x)\) and the surrogate distribution \(\hat{p}(x)\) with a Gaussian distribution \( \hat{p}(x,y) = \mathcal{N}(x,y|\hat{\mu},\hat{\Sigma}).\)

Surrogate distribution

The first step is to replace the Gaussian distribution \( p(x) \) with a surrogate distribution \(\hat{p}(x)\) that has the same mean and variance. For this surrogate distribution, we choose a mix of weighted Dirac delta functions

where \(\delta_{\chi^i}(x)\) is a shorthand for \(\delta(x-\chi^i)\) and the weights are summing to 1:

The shifting parameter of the Dirac delta functions \(\chi^i\) are called sigma points and are chosen to correspond to

where \(L_i\) describes the \(i\)th column of the Cholesky decomposition \(\Sigma = LL^T\) and \(n\) is the dimension of the multivariate Gaussian distribution \(p(x)\). The parameters \(a_i\) are arbitrary and can be chosen by the designer. Because of the symmetry of the sigma points, the weights will be symmetric as well

Let’s answer the question of how we have to choose the weights \(w^i\), to obtain the desired property

We start by plugging our sigma points \(\chi^i\) into the mean

and notice, that this property is independent of the weights \(w^i\).

Let’s move to the variance and see how we have to choose the weights \(w^i\) to match

We start by plugging the corresponding sigma points \(\chi^i\) into the variance

We note, that all occurences of the mean are canceled out. In particular, the first term corresponding to the central sigma point \(\chi^0\) is vanishing completely. Therefore, the corresponding weight plays no direct role for the mean and variance. By using the symmetry properties we finally arrive at

We know, that the variance can be expressed by

By a comparison of coefficients, we notice that the property

has to be satisfied. It follows that the weights should be \(w^i = \frac{1}{2a_i^2} \) for \(i=1,\ldots,n\).

Furthermore, by knowing that all weights have to sum to \(1\), it follows that the weight of the central sigma point \(\chi^0\) has to be

We finally found our surrogate distribution

with the correct mean \(\mu\) and variance \(\Sigma\).

Gaussian approximation of the joint probability

Let’s look at the second part: We want to approximate the joint distribution of the true conditional distribution \(p(y|x)\) and the surrogate distribution \(\hat{p}(x)\) with a Gaussian distribution \( \hat{p}(x,y) = \mathcal{N}(x,y|\hat{\mu},\hat{\Sigma}).\)

We can write the joint probability of \(p(y|x)\) and \(\hat{p}(x)\) as

This joint probability will look like weighted slices of the conditional probability \(p(y|x) \).

The next step is to approximate this weird looking distribution with a Gaussian distribution. But how should we choose the parameter of this Gaussian joint distribution? The answer is simple: We are calculating the mean and the variance of the joint probability, which will define the mean and variance of our joint Gaussian distribution. Fair enough! So, let’s dive directly into the calculation of the mean. We are starting directly from the definition of the expectation

and plug in the corresponding distributions to obtain

We interchange the sum with the integrals

and use the sifting property of the Dirac delta function to obtain

We can simplify this expression even further to

We are done! To finish our approximation we have to calculate the covariance of the joint probability. Let’s start again from the definition of the variance

and plug in our distributions to obtain

Again we are rearranging sum and integrals

and use the sifting property to finally obtain

Now we have also a nice expression for the variance of the approximated joint probability \(\hat{p}(x,y)\).

Let’s summarize our result

Joint probability interpretation of the unscented transform

Given the Gaussian conditional distribution with nonlinear mean and constant variance

and a Gaussian prior \(p(x) = \mathcal{N}(y|\mu, \Sigma) \). We can approximate the joint probability \(p(x,y)\) by a Gaussian joint probability \(\hat{p}(x,y)\), where the prior \(p(x)\) is replaced by

with the weights \(w^i\) are defined by

and the with sigma points \(\chi^i\) by

The resulting Gaussian approximation has the following form

Please be aware, that this is not the definition of the unscented transform. But it will be easier to work with this form for now and identify the normal unscented transform as a special case.

Ok, now we know how to approximate the joint probability of Gaussian distributions with nonlinear mean and a Gaussian prior. But what can we do with it?

If we look closely at the prediction step

and the update step

of the general Bayes filter, we can find the joint probability of the form we just discussed. For the prediction step we have

and for the update step

That is pretty nice! We can approximate joint probabilities and the equations of the Bayes filter can take joint probabilities as input. So don’t waste time and start directly with the prediction step.

Prediction step

We want to calculate

But instead of using the true joint probability \(p(x_{t+1}|x_{t}, u_{t})p(x_{t}|y_{0:t},u_{0:t-1})\) we will use our approximation \(\hat{p}(x_{t+1},x_{t}|y_{0:t},u_{0:t})\)

Fortunately, marginalizing over a joint Gaussian distribution is very simple. You just have to discard the dimensions you are marginalizing over:

Therefore, we only have to calculate the mean and covariance of the remaining part. Our predicted state estimate will be a Gaussian distribution with parameters

Update step

Now we will look at the update step

and replace the true joint probability \(p(y_t|x_t)p(x_t|y_{0:t-1},u_{0:t-1})\) with our approximation \(\hat{p}(y_t,x_t|y_{0:t-1},u_{0:t-1})\) and obtain

Now we have to options: Either we just simplify these equations to obtain the final update equation of the unscented Kalman filter or we can be smart and note, that we would compute nothing else, but the update step of the Kalman filter. Let’s take that second option.

The joint probability in the linear Gaussian case (where we already have the equations of the Kalman filter) can be expressed as

The only thing we have to do is a coefficient comparison with the joint Gaussian of our estimate:

We identify the following correspondences:

We are done and summarize our results as follows.

Equations of the unscented Kalman filter

The recursive formula for the unscented Kalman filter consists of the prediction step

and the update step

Unscented Kalman filter

To obtain the equations of the unscented Kalman filter that are normally used, we have to adjust the result of our derivation in two aspects. The first aspect is, that we are choosing specific values for the parameters \(a_i\) corresponding to

The weights for \(i=1,\ldots,2n \) are set to

For the center sigma point \(\chi^0\), we will use a different weight \(w_s^0\) and \(w_c^0\) for the calculation of the mean and the variance respectively defined by

where the parameter \(\lambda\) is defined as

A typical recommendation is to use \(\kappa = 0\), \(\alpha = 10^{-3}\) and \(\beta=2\). We noted above, that the covariance of our surrogate prior \(\hat{p}(x)\) is not depending on the weight \(w^0\). But please be aware, that it will certainly impact the rest of the joint probability distribution.

The second aspect we have to change (or at least could change), is that we don’t have to recalculate our sigma points after the prediction step. We can simply use the transformed sigma points \(f(\chi^i)\) as our new sigma points.

Enough of the dry theory! Let’s play around with the unscented Kalman filter in our race track example.

Example

Backward
No action
Forward
Predict step
Observe
Update step

On the outside of the race track, you will notice a blue colored strip. This strip represents our current posterior of the current position of the race car. We will start with a Gaussian prior distribution around the true mean. By pressing the OBSERVE button two things will happen: first, we will take a measurement of the distance to the tree and second, we will display the likelihood for this observed distance on the brown strip inside the race track. By pressing the UPDATE STEP button, we will perform our update step and show the resulting posterior at the outer strip. Now we are ready for the next time step. Take an action, by pressing the corresponding button below the race track. After the step is performed, you have to update your posterior by pressing the PREDICT STEP button. You will see that the outer strip will change accordingly. Now we finished one full cycle of the filtering process and are ready to start a new cycle by taking a measurement.

What if our distance meter is not working anymore? By either clicking on the tree or pressing the W button on your keyboard, you can turn off your measurement device. Therefore, the observation and update step will be skipped. The tree will become opaque, if your measurement device is turned off.

If you want to reset the environment, just press the reset button in the bottom left corner or press the R button on your keyboard. As before you can control the car by using your keyboard: A (Backward), S (Stop), D (Forward) or the buttons below the race track.

If you still didn’t have enough of nonlinear filtering, you should check out the next article in this series about the derivation of the particle filter.

Acknowledgement

The vector graphics of the car were created by Freepik.