Nonlinear filtering: Unscented Kalman filter
The unscented Kalman filter describes another method for approximating the process of non-linear Bayes filtering. In this article, we will derive the corresponding equations directly from the general Bayes filter. Furthermore, we will get to know a different way to think about the unscented transform. If you haven’t read the introduction, I would recommend to read it first. Before we dive into the derivation, let’s try to state the main idea behind the unscented Kalman filter.
The unscented Kalman filter approximates the Bayes filter by approximating the system and observation equations locally. A set of probe points is used to infer the local behavior of the function around the current estimate.
When I first learned about the unscented Kalman filter, it seemed that it was not derived from the general Bayes filter
but rather directly from the Kalman filter
In this post, I want to explore how to derive the unscented Kalman filter from general Bayes filter directly. But there is one catch, to obtain the equations of the unscented Kalman filter, you have to do some steps that seem hard to justify. It is totally possible, that there are deep thoughts behind it, but my limited scope is preventing me to identify them. Nonetheless, I will start to derive a different version of the unscented Kalman filter that is in some regards more principled. In the end, I will adjust this version to match the normal unscented Kalman filter.
Let’s start with the derivation!
Derivation
First of all, if we want to apply the unscented Kalman filter, we assume that we have nonlinear models with additive noise
Unlike the extended Kalman filter, which is based on linearization, the unscented Kalman filter is based around the unscented transform. In simplest terms, if you want to transform a Gaussian distribution
through a nonlinear function, it will give you an estimate of the resulting mean and covariance.
For the derivation of the equations of unscented Kalman filter, I will state the unscented transform in a slightly different form, which I will call the joint probability interpretation of the unscented transform.
Joint probability interpretation of the unscented transform
In this section, we explore an interpretation of the unscented transform in terms of the joint probability. We are in a setting with a Gaussian conditional distribution with nonlinear mean and constant variance
and a Gaussian prior \(p(x) = \mathcal{N}(y|\mu, \Sigma) \).
Conceptually, we will first replace the prior with a surrogate distribution \(\hat{p}(x)\). Subsequently, we will approximate the joint distribution of the true conditional distribution \(p(y|x)\) and the surrogate distribution \(\hat{p}(x)\) with a Gaussian distribution \( \hat{p}(x,y) = \mathcal{N}(x,y|\hat{\mu},\hat{\Sigma}).\)
Surrogate distribution
The first step is to replace the Gaussian distribution \( p(x) \) with a surrogate distribution \(\hat{p}(x)\) that has the same mean and variance. For this surrogate distribution, we choose a mix of weighted Dirac delta functions
where \(\delta_{\chi^i}(x)\) is a shorthand for \(\delta(x-\chi^i)\) and the weights are summing to 1:
The shifting parameter of the Dirac delta functions \(\chi^i\) are called sigma points and are chosen to correspond to
where \(L_i\) describes the \(i\)th column of the Cholesky decomposition \(\Sigma = LL^T\) and \(n\) is the dimension of the multivariate Gaussian distribution \(p(x)\). The parameters \(a_i\) are arbitrary and can be chosen by the designer. Because of the symmetry of the sigma points, the weights will be symmetric as well
Let’s answer the question of how we have to choose the weights \(w^i\), to obtain the desired property
We start by plugging our sigma points \(\chi^i\) into the mean
and notice, that this property is independent of the weights \(w^i\).
Let’s move to the variance and see how we have to choose the weights \(w^i\) to match
We start by plugging the corresponding sigma points \(\chi^i\) into the variance
We note, that all occurences of the mean are canceled out. In particular, the first term corresponding to the central sigma point \(\chi^0\) is vanishing completely. Therefore, the corresponding weight plays no direct role for the mean and variance. By using the symmetry properties we finally arrive at
We know, that the variance can be expressed by
By a comparison of coefficients, we notice that the property
has to be satisfied. It follows that the weights should be \(w^i = \frac{1}{2a_i^2} \) for \(i=1,\ldots,n\).
Furthermore, by knowing that all weights have to sum to \(1\), it follows that the weight of the central sigma point \(\chi^0\) has to be
We finally found our surrogate distribution
with the correct mean \(\mu\) and variance \(\Sigma\).
Gaussian approximation of the joint probability
Let’s look at the second part: We want to approximate the joint distribution of the true conditional distribution \(p(y|x)\) and the surrogate distribution \(\hat{p}(x)\) with a Gaussian distribution \( \hat{p}(x,y) = \mathcal{N}(x,y|\hat{\mu},\hat{\Sigma}).\)
We can write the joint probability of \(p(y|x)\) and \(\hat{p}(x)\) as
This joint probability will look like weighted slices of the conditional probability \(p(y|x) \).
The next step is to approximate this weird looking distribution with a Gaussian distribution. But how should we choose the parameter of this Gaussian joint distribution? The answer is simple: We are calculating the mean and the variance of the joint probability, which will define the mean and variance of our joint Gaussian distribution. Fair enough! So, let’s dive directly into the calculation of the mean. We are starting directly from the definition of the expectation
and plug in the corresponding distributions to obtain
We interchange the sum with the integrals
and use the sifting property of the Dirac delta function to obtain
We can simplify this expression even further to
We are done! To finish our approximation we have to calculate the covariance of the joint probability. Let’s start again from the definition of the variance
and plug in our distributions to obtain
Again we are rearranging sum and integrals
and use the sifting property to finally obtain
Now we have also a nice expression for the variance of the approximated joint probability \(\hat{p}(x,y)\).
Let’s summarize our result
Joint probability interpretation of the unscented transform
Given the Gaussian conditional distribution with nonlinear mean and constant variance
and a Gaussian prior \(p(x) = \mathcal{N}(y|\mu, \Sigma) \). We can approximate the joint probability \(p(x,y)\) by a Gaussian joint probability \(\hat{p}(x,y)\), where the prior \(p(x)\) is replaced by
with the weights \(w^i\) are defined by
and the with sigma points \(\chi^i\) by
The resulting Gaussian approximation has the following form
Please be aware, that this is not the definition of the unscented transform. But it will be easier to work with this form for now and identify the normal unscented transform as a special case.
Ok, now we know how to approximate the joint probability of Gaussian distributions with nonlinear mean and a Gaussian prior. But what can we do with it?
If we look closely at the prediction step
and the update step
of the general Bayes filter, we can find the joint probability of the form we just discussed. For the prediction step we have
and for the update step
That is pretty nice! We can approximate joint probabilities and the equations of the Bayes filter can take joint probabilities as input. So don’t waste time and start directly with the prediction step.
Prediction step
We want to calculate
But instead of using the true joint probability \(p(x_{t+1}|x_{t}, u_{t})p(x_{t}|y_{0:t},u_{0:t-1})\) we will use our approximation \(\hat{p}(x_{t+1},x_{t}|y_{0:t},u_{0:t})\)
Fortunately, marginalizing over a joint Gaussian distribution is very simple. You just have to discard the dimensions you are marginalizing over:
Therefore, we only have to calculate the mean and covariance of the remaining part. Our predicted state estimate will be a Gaussian distribution with parameters
Update step
Now we will look at the update step
and replace the true joint probability \(p(y_t|x_t)p(x_t|y_{0:t-1},u_{0:t-1})\) with our approximation \(\hat{p}(y_t,x_t|y_{0:t-1},u_{0:t-1})\) and obtain
Now we have to options: Either we just simplify these equations to obtain the final update equation of the unscented Kalman filter or we can be smart and note, that we would compute nothing else, but the update step of the Kalman filter. Let’s take that second option.
The joint probability in the linear Gaussian case (where we already have the equations of the Kalman filter) can be expressed as
The only thing we have to do is a coefficient comparison with the joint Gaussian of our estimate:
We identify the following correspondences:
We are done and summarize our results as follows.
Equations of the unscented Kalman filter
The recursive formula for the unscented Kalman filter consists of the prediction step
and the update step
Unscented Kalman filter
To obtain the equations of the unscented Kalman filter that are normally used, we have to adjust the result of our derivation in two aspects. The first aspect is, that we are choosing specific values for the parameters \(a_i\) corresponding to
The weights for \(i=1,\ldots,2n \) are set to
For the center sigma point \(\chi^0\), we will use a different weight \(w_s^0\) and \(w_c^0\) for the calculation of the mean and the variance respectively defined by
where the parameter \(\lambda\) is defined as
A typical recommendation is to use \(\kappa = 0\), \(\alpha = 10^{-3}\) and \(\beta=2\). We noted above, that the covariance of our surrogate prior \(\hat{p}(x)\) is not depending on the weight \(w^0\). But please be aware, that it will certainly impact the rest of the joint probability distribution.
The second aspect we have to change (or at least could change), is that we don’t have to recalculate our sigma points after the prediction step. We can simply use the transformed sigma points \(f(\chi^i)\) as our new sigma points.
Enough of the dry theory! Let’s play around with the unscented Kalman filter in our race track example.
Example
On the outside of the race track, you will notice a blue colored strip. This strip represents our current posterior of the current position of the race car. We will start with a Gaussian prior distribution around the true mean. By pressing the OBSERVE button two things will happen: first, we will take a measurement of the distance to the tree and second, we will display the likelihood for this observed distance on the brown strip inside the race track. By pressing the UPDATE STEP button, we will perform our update step and show the resulting posterior at the outer strip. Now we are ready for the next time step. Take an action, by pressing the corresponding button below the race track. After the step is performed, you have to update your posterior by pressing the PREDICT STEP button. You will see that the outer strip will change accordingly. Now we finished one full cycle of the filtering process and are ready to start a new cycle by taking a measurement.
What if our distance meter is not working anymore? By either clicking on the tree or pressing the W button on your keyboard, you can turn off your measurement device. Therefore, the observation and update step will be skipped. The tree will become opaque, if your measurement device is turned off.
If you want to reset the environment, just press the reset button in the bottom left corner or press the R button on your keyboard. As before you can control the car by using your keyboard: A (Backward), S (Stop), D (Forward) or the buttons below the race track.
If you still didn’t have enough of nonlinear filtering, you should check out the next article in this series about the derivation of the particle filter.