LINEAR ESTIMATION
AND THE KALMAN FILTER
by Rick Martinelli, Haiku Laboratories, June 2008
Copyright ©, 2008
Minimum Variance Unbiased Estimate
The purpose of this paper is to develop the equations of
the linear Kalman filter for use in data analysis. Our primary interest is the smoothing and
prediction of financial market data, and the Kalman filter is one of the most
powerful tools for this purpose. It is a recursive algorithm that predicts
future values of an item based on the information in all its past values. It also is a least-squares procedure in that
its prediction-error-variance is minimized at each stage. Development of the equations proceeds in
steps, starting with ordinary least-squares estimation, to the Gauss-Markov
estimate, minimum variance estimation, recursive estimation and finally the
Kalman filter. Much of this development
was taken from references [1] and [2].
We begin with the simplest linear model y = Ax, where y is
an n ´ 1 vector of measurements y(k), x
is an m ´ 1 vector of unknown parameter values x(j) and A is
an n ´ m known model matrix. If n = m, and A is invertible, the
solution is given by x = A-1y.
For least-squares we assume the model is over-determined, i.e., there
are more measurements than parameters to be determined, and so n >
m. Therefore A is not square and will not
have an inverse. Instead we seek the
least-squares estimate x of
x as the minimizer of the length of the residual vector
e = y – Ax.
This
number is the sum of the squares of the n separate residuals, and is denoted
|e|2 = (y – Ax)T(y
–Ax),
where AT
is the transpose of A. Minimizing this
quadratic with respect to x yields the normal equations
ATA x = ATy.
The matrix ATA is symmetric and positive definite
so has an inverse. Hence we can write
x = (ATA)-1ATy
This is
the OLS estimate of the parameter vector x.
If we define the m ´ n matrix LO = (ATCA)-1AT,
then we have the simple expression
x = LOy.
From y = Ax we find the OLS estimate of y to
be
y = ALOy = A(ATA)-1ATy =
Sy.
This
result has a nice geometric interpretation: the n ´ n matrix S is a projection matrix
and y is the orthogonal projection of y
onto the subspace generated by the columns of A. Projections also yield very effective methods
for smoothing data (see[3]).
We next consider the notion of weighted least-squares (WLS), where weights are assigned to
residuals according to the accuracy of the corresponding measurement. For example, if y(1) is considered more
accurate than y(2), we would like to make e(1) smaller than e(2). Thus we replace the sum of the squares of
residuals |e|2 with a weighted sum
|We|2 = (Wy – WAx)T(Wy –WAx)
where W is an n ´ n matrix of weights. If the measurements are independent, W is
diagonal. If measurements are
correlated there will be off-diagonal elements.
In all cases, W is symmetric, positive definite. In the case of WLS, the normal equations
become
(WA)TWAx = (WA)TWy.
and again
we can take the inverse and write
x = ((WA)TWA)-1(WA)TWy.
Since (WA)T = ATWT, this
is
x = (ATWTWA)-1ATWTWy.
Letting C = WTW, we have finally the WLS
estimate LW
x= LWy = (ATCA)-1ATCy.
Note that
OLS corresponds to a weight matrix W that is the identity matrix, that is, all
measurement weights are equal to one (LW = LI).
MINIMUM VARIANCE UNBIASED ESTIMTOR (MVU)
To determine the optimum choice for the matrix C in the
WLS estimate we need to add some statistical properties to the model. We assume the measurement vector y, and hence
the residual vector e = y - Ax, are random vectors. One required
statistical property is that the residuals are unbiased, that is, equally scattered around the estimate. This is expressed by E[e(k)] = 0 for each k,
where E is the expected
value operator.
Each residual also has a variance denoted Vk
= E[e(k)2], and any correlations between residuals show up as covariances Vkm =
Vmk = E[x(k)x(m)]. These
are collected in the covariance matrix
of e, defined by
R = E[eeT].
The n ´ n matrix R has the variances Vk on
its diagonal, covariances Vkm = Vmk as off-diagonal
elements and is positive-semidefinite. Matrix R will be used to calculate C, after
defining another required statistical property.
The second required property is that the linear estimator
L is also unbiased in the sense that the expected error in the estimate, x – x, is zero. Although true x values are unknown, we may
still write
0 = E[x – x] = E[x – Ly] = E[x – LAx – Le] =
E[(I – LA)x]
Hence, L is unbiased if LA = I, the m ´ m identity matrix. Here we’ve used the fact that
E[Le] = LE[e] = 0,
since
both E and L are linear. The estimate
errors also have a covariance matrix given by
P = E[(x – x)(x – x)T].
P is n ´ n and
positive-semidefinite. Matrix P is called
the fundamental matrix of filtering
theory. The minimization of the trace of this matrix
provides the optimum choice for C mentioned above. The trace of P may be written
E|x-x|2 = E[(x – x)T(x – x)]
It is this quantity we seek to minimize for MV
estimation. The solution to this
optimization problem was found long ago to be C = R-1 , and the
result is known as the Gauss-Markov
theorem:
Given y = Ax + e, with E[eeT] = R,
the minimum-variance unbiased (MVU) estimator of x given y is
x = LGy= (ATR-1A)-1ATR-1y
with error covariance
P = (ATR-1A)-1.
It is
easily seen that LGA = I, showing that LG is
unbiased. The general WLS estimator
LW = (ATCA)-1ATC
is also
unbiased since LWA = I again.
This includes OLS where C = R = I, meaning that in OLS, the residuals are
tacitly assumed to be uncorrelated, and to have variance 1.
MINIMUM VARIANCE ESTIMATES (MV)
The previous estimates have assumed that x is a completely
unknown parameter vector, capable of assuming any value. In many situations however there is prior statistical
information about x, which may limit its possible values. This a
priori information may be exploited to produce an estimate, the minimum-variance (MV) estimate, with
smaller error variance than the MVU estimate.
We adopt the model y = Ax + e, where now x, y and e are
all random vectors, and A is a known, constant matrix. As before we assume E[eeT] =
R. We now assume in addition that E[xxT]
= Q, where Q is positive-semidefinite, and that x and e are uncorrelated, i.e.,
E[exT] = 0, the zero matrix.
Under these conditions we have the Minimum-Variance
Estimation theorem:
Given y = Ax + e, with E[xxT] = Q, E[eeT] = R and E[exT] =
0, the linear estimate that minimizes the error variance E|x - x|2 is
x = LMy = (ATR-1A
+ Q-1)-1ATR-1y
with error covariance
P = (ATR-1A
+ Q-1)-1
LM is the linear minimum-variance estimator of
x given y. Notice that if Q-1 = 0, corresponding
to infinite variance of the a priori information, LM reduces to LG, the
MVU estimator.
Suppose that a MV estimate x0 of x has been found based on previous measurements and has error
covariance P0. When new
measurements become available we want an update x1 of x0 based on the new information in
the measurements. Assume the new
measurements are in the form y = Ax + e used above. The best estimate of the new measurements y,
based on the past measurements only,
is y = Ax. The error in this estimate, y
– y, is called the innovation. It represents that part of the information in
y that is orthogonal
to Ax, i.e., could not be derived from x by a linear operation. The estimate x1 is given by the Recursive Estimation theorem:
If x0 is the MV estimate of x based
on some previous measurements and has error covariance P0, and if
new measurements y become available, the updated MV estimate x1 of x based on all the
information is given by
x1 = x0 + P0AT(AP0AT
+ R)-1(y – Ax0) = LM(y – Ax0)
with error covariance
P1
= P0 - P0AT(AP0AT + R)-1AP0.
Notice
that the old estimate x0 is updated by the action of LM
on the innovation y – Ax0, and that the error covariance
decreases with the addition of the new information in y.
The recursive MV estimate described above assumes the
parameter vector x is random with fixed, but unknown mean value. However, many situations arise in which the
parameter vector changes with time, e.g., the position of a moving vehicle, or
the condition of the stock market. Thus
we speak of the state of the
parameter vector, or simply the state
vector for the system, and indicate its value at time k by xk. In addition we assume the state vector is
updated by the state model consisting
of a linear update plus model noise:
xk+1 = Fxk +
where F
is a known m ´ m matrix and
yk = Axk + ek
where we assume E[ek] = 0 and E[ekekT] =
Rk for each k, with Rk positive-definite. The sequence ek
is called measurement noise. Taken
together these two equations and their noise statistics constitute a discrete dynamical system.
A discrete random
processes is a sequence of random variables. The sequences yk, ek, xk
and
1. prediction – estimation of future values of the process from
previous measurements
2. filtering - estimation of the current value of the process from
previous inexact measurements up to the present
3. smoothing - estimation of past values of the process from previous
inexact measurements up to the present
This article deals with the first two of these; for
smoothing applications see [3]. The Kalman filter provides a linear,
minimum-variance, recursive estimation procedure based on ideas in the previous
sections. The model for the Kalman
filter is a discrete dynamical system
xk = Fxk-1 +
yk = Axk + ek
where F and A are known matrices, E[ek] = 0,
E[ekekT] = Rk, E[uk]
= 0 and E[ukukT] = Qk, and
k=0,1,2,…. Assume also an initial
estimate x0 of x with error covariance P0
= E[(x0 – x0)(x0 – x0)T]. The notation xk|k-1 will indicate the estimate of xk from
observations up to time k-1, and similarly for Pk|k-1. The Kalman
filter theorem is:
The linear MV estimate xk|k-1 of the state vector xk
based on the previous k-1 measurements is updated according to the filter extrapolation equations:
xk|k-1 = F xk-1|k-1
Pk|k-1
= FPk-1|k-1FT + Qk-1,
Initial conditions are x0|-1 = x0
and P0|-1 = P0.
In addition to these we have the filter update equations:
xk|k = xk|k-1 + Gk (yk
– A xk|k-1)
Pk|k
= Pk|k-1 - Gk APk|k-1 ,
where the filter
gain is given by
Gk
= Pk|k-1 AT (APk|k-1AT + Rk-1)-1
The filter equations may be generalized by allowing the
matrices A and F to also change with time, i.e., A becomes Ak and F
becomes Fk. They may also be
simplified by requiring the measurements or the state or both to be
scalars. A system in which the noise
variances are estimated from the data is called an adaptive filter. Ultimately,
the Extended Kalman Filter allows
non-linear functions F(xk,uk) and A(xk,ek)
to be used.
SUMMARY
The following table is intended to summarize the estimates
developed here in a way that shows how each is either a generalization of, or a
restriction on, one of the previous estimates.
Ordinary
least-squares |
x =
(ATA)-1ATy |
Weighted
least-squares |
x=
(ATCA)-1ATCy. |
Gauss-Markov |
x
= (ATR-1A)-1ATR-1y |
Minimum
variance |
x
= (ATR-1A + Q-1)-1ATR-1y |
Recursive
minimum variance |
x1 = x0 + P0AT(AP0AT + R)-1(y
– A x0) |
Kalman
filter update |
xk|k = xk|k-1 + Pk|k-1AT(APk|k-1AT
+ Rk-1)-1 (yk – Axk|k-1) |
1. Luenberger, D.G., Optimization
by Vector Space Methods, John Wiley and Sons, 1969.
2. Strang, G, Introduction
to Applied Mathematics, Wellesley-Cambridge Press, 1986.
3.
Martinelli, R, Data Smoothing by Vector Space Methods,
Haiku Laboratories, 2006.