by Rick Martinelli, Haiku Laboratories, June 2008

Copyright ©, 2008



Ordinary Least Squares

Weighted Least Squares

Minimum Variance Unbiased Estimate

Minimum Variance Estimates

Recursive Estimation

The Kalman Filter




The purpose of this paper is to develop the equations of the linear Kalman filter for use in data analysis.  Our primary interest is the smoothing and prediction of financial market data, and the Kalman filter is one of the most powerful tools for this purpose. It is a recursive algorithm that predicts future values of an item based on the information in all its past values. It also is a least-squares procedure in that its prediction-error-variance is minimized at each stage.  Development of the equations proceeds in steps, starting with ordinary least-squares estimation, to the Gauss-Markov estimate, minimum variance estimation, recursive estimation and finally the Kalman filter.  Much of this development was taken from references [1] and [2].



We begin with the simplest linear model y = Ax, where y is an n ´ 1 vector of measurements y(k), x is an m ´ 1 vector of unknown parameter values x(j) and A is an n ´ m known model matrix.  If n = m, and A is invertible, the solution is given by x = A-1y.  For least-squares we assume the model is over-determined, i.e., there are more measurements than parameters to be determined, and so n > m.  Therefore A is not square and will not have an inverse.  Instead we seek the least-squares estimate x of x as the minimizer of the length of the residual vector


e = y – Ax.


This number is the sum of the squares of the n separate residuals, and is denoted


|e|2 = (y – Ax)T(y –Ax),


where AT is the transpose of A.  Minimizing this quadratic with respect to x yields the normal equations


ATA x = ATy.


The matrix ATA is symmetric and positive definite so has an inverse.  Hence we can write


x = (ATA)-1ATy


This is the OLS estimate of the parameter vector x.  If we define the m ´ n matrix LO = (ATCA)-1AT, then we have the simple expression


x = LOy.


 From y = Ax we find the OLS estimate of y to be


y = ALOy = A(ATA)-1ATy = Sy.


This result has a nice geometric interpretation: the n ´ n matrix S is a projection matrix and y is the orthogonal projection of y onto the subspace generated by the columns of A.  Projections also yield very effective methods for smoothing data (see[3]).



We next consider the notion of weighted least-squares (WLS), where weights are assigned to residuals according to the accuracy of the corresponding measurement.  For example, if y(1) is considered more accurate than y(2), we would like to make e(1) smaller than e(2).  Thus we replace the sum of the squares of residuals |e|2 with a weighted sum


|We|2 = (Wy – WAx)T(Wy –WAx)


where W is an n ´ n matrix of weights.  If the measurements are independent, W is diagonal.   If measurements are correlated there will be off-diagonal elements.  In all cases, W is symmetric, positive definite.  In the case of WLS, the normal equations become


(WA)TWAx = (WA)TWy.


and again we can take the inverse and write


x = ((WA)TWA)-1(WA)TWy.


Since (WA)T = ATWT, this is




Letting C = WTW, we have finally the WLS estimate LW


x= LWy = (ATCA)-1ATCy.


Note that OLS corresponds to a weight matrix W that is the identity matrix, that is, all measurement weights are equal to one (LW = LI).



To determine the optimum choice for the matrix C in the WLS estimate we need to add some statistical properties to the model.  We assume the measurement vector y, and hence the residual vector e = y - Ax, are random vectors.  One required statistical property is that the residuals are unbiased, that is, equally scattered around the estimate.  This is expressed by E[e(k)] = 0 for each k, where E is the expected value operator.


Each residual also has a variance denoted Vk = E[e(k)2], and any correlations between residuals show up as covariances Vkm = Vmk = E[x(k)x(m)].  These are collected in the covariance matrix of e, defined by


R = E[eeT].


The n ´ n  matrix R has the variances Vk on its diagonal, covariances Vkm = Vmk as off-diagonal elements and is positive-semidefinite.  Matrix R will be used to calculate C, after defining another required statistical property. 


The second required property is that the linear estimator L is also unbiased in the sense that the expected error in the estimate, x – x, is zero.  Although true x values are unknown, we may still write


0 = E[x – x] = E[x – Ly] = E[x – LAx – Le] = E[(I – LA)x]


Hence, L is unbiased if LA = I, the m ´ m identity matrix.  Here we’ve used the fact that


E[Le] = LE[e] = 0,


since both E and L are linear.  The estimate errors also have a covariance matrix given by


P = E[(x – x)(x – x)T].


P is n ´ n and positive-semidefinite. Matrix P is called the fundamental matrix of filtering theory. The minimization of the trace of this matrix provides the optimum choice for C mentioned above.  The trace of P may be written


E|x-x|2 =  E[(x – x)T(x – x)]


It is this quantity we seek to minimize for MV estimation.  The solution to this optimization problem was found long ago to be C = R-1 , and the result is known as the Gauss-Markov theorem:


Given y = Ax + e, with E[eeT] = R, the minimum-variance unbiased (MVU) estimator of x given y is


x = LGy= (ATR-1A)-1ATR-1y


with error covariance


P = (ATR-1A)-1.


It is easily seen that LGA = I, showing that LG is unbiased.  The general WLS estimator




is also unbiased since LWA = I again.  This includes OLS where C = R = I, meaning that in OLS, the residuals are tacitly assumed to be uncorrelated, and to have variance 1.



The previous estimates have assumed that x is a completely unknown parameter vector, capable of assuming any value.  In many situations however there is prior statistical information about x, which may limit its possible values.  This a priori information may be exploited to produce an estimate, the minimum-variance (MV) estimate, with smaller error variance than the MVU estimate. 


We adopt the model y = Ax + e, where now x, y and e are all random vectors, and A is a known, constant matrix.  As before we assume E[eeT] = R.  We now assume in addition that E[xxT] = Q, where Q is positive-semidefinite, and that x and e are uncorrelated, i.e., E[exT] = 0, the zero matrix.  Under these conditions we have the Minimum-Variance Estimation theorem:


Given y = Ax + e, with E[xxT] = Q,  E[eeT] = R and E[exT] = 0, the linear estimate that minimizes the error variance E|x - x|2  is


x = LMy = (ATR-1A + Q-1)-1ATR-1y


with error covariance


P = (ATR-1A + Q-1)-1


LM is the linear minimum-variance estimator of x given y.  Notice that if  Q-1 = 0, corresponding to infinite variance of the a priori information,  LM reduces to LG, the MVU estimator. 



Suppose that a MV estimate x0 of x has been found based on previous measurements and has error covariance P0.  When new measurements become available we want an update x1 of x0 based on the new information in the measurements.  Assume the new measurements are in the form y = Ax + e used above.  The best estimate of the new measurements y, based on the past measurements only, is y = Ax.  The error in this estimate, y – y, is called the innovation.  It represents that part of the information in y that is orthogonal to Ax, i.e., could not be derived from x by a linear operation.  The estimate x1 is given by the  Recursive Estimation theorem:


If x0 is the MV estimate of x based on some previous measurements and has error covariance P0, and if new measurements y become available, the updated MV estimate x1 of x based on all the information is given by


x1 = x0 + P0AT(AP0AT + R)-1(y – Ax0) = LM(y – Ax0)


with error covariance


P1 = P0 - P0AT(AP0AT + R)-1AP0.


Notice that the old estimate x0 is updated by the action of LM on the innovation y – Ax0, and that the error covariance decreases with the addition of the new information in y.



The recursive MV estimate described above assumes the parameter vector x is random with fixed, but unknown mean value.  However, many situations arise in which the parameter vector changes with time, e.g., the position of a moving vehicle, or the condition of the stock market.  Thus we speak of the state of the parameter vector, or simply the state vector for the system, and indicate its value at time k by xk.  In addition we assume the state vector is updated by the state model consisting of a linear update plus model noise:


xk+1 = Fxk + uk


where F is a known m ´ m matrix and uk is a random m ´ 1 vector with E[uk] = 0 and E[ukukT] = Qk for each k.  The measurement model becomes


yk = Axk + ek


where we assume E[ek] = 0 and E[ekekT] = Rk for each k, with Rk positive-definite. The sequence ek is called measurement noise.  Taken together these two equations and their noise statistics constitute a discrete dynamical system. 


A discrete random processes is a sequence of random variables.  The sequences yk, ek, xk and uk are examples, xk and uk being vector-valued.  There are three traditional estimation problems associated with random processes:


1. prediction – estimation of future values of the process from previous measurements

2. filtering - estimation of the current value of the process from previous inexact measurements up to the present

3. smoothing - estimation of past values of the process from previous inexact measurements up to the present


This article deals with the first two of these; for smoothing applications see [3].  The Kalman filter provides a linear, minimum-variance, recursive estimation procedure based on ideas in the previous sections.  The model for the Kalman filter is a discrete dynamical system 


xk = Fxk-1 + uk

yk = Axk + ek


where F and A are known matrices, E[ek] = 0, E[ekekT] = Rk, E[uk] = 0 and E[ukukT] = Qk, and k=0,1,2,….  Assume also an initial estimate x0 of x with error covariance P0 = E[(x0 – x0)(x0 – x0)T].  The notation xk|k-1 will indicate the estimate of xk from observations up to time k-1, and similarly for Pk|k-1.  The Kalman filter theorem is: 


The linear MV estimate xk|k-1 of the state vector xk based on the previous k-1 measurements is updated according to the filter extrapolation equations:


xk|k-1 = F xk-1|k-1


Pk|k-1 = FPk-1|k-1FT + Qk-1,


Initial conditions are x0|-1 = x0 and P0|-1 = P0.   In addition to these we have the filter update equations:


xk|k = xk|k-1 + Gk (yk – A xk|k-1)


Pk|k = Pk|k-1 - Gk APk|k-1 ,


where the filter gain is given by


Gk = Pk|k-1 AT (APk|k-1AT + Rk-1)-1


The filter equations may be generalized by allowing the matrices A and F to also change with time, i.e., A becomes Ak and F becomes Fk.  They may also be simplified by requiring the measurements or the state or both to be scalars.  A system in which the noise variances are estimated from the data is called an adaptive filter.  Ultimately, the Extended Kalman Filter allows non-linear functions F(xk,uk) and A(xk,ek) to be used. 



The following table is intended to summarize the estimates developed here in a way that shows how each is either a generalization of, or a restriction on, one of the previous estimates.


Ordinary least-squares

x = (ATA)-1ATy

Weighted least-squares

x= (ATCA)-1ATCy.


x = (ATR-1A)-1ATR-1y

Minimum variance

x =  (ATR-1A + Q-1)-1ATR-1y

Recursive minimum variance

x1 = x0 + P0AT(AP0AT + R)-1(y – A x0)

Kalman filter update

xk|k = xk|k-1 + Pk|k-1AT(APk|k-1AT + Rk-1)-1 (yk – Axk|k-1)



1. Luenberger, D.G., Optimization by Vector Space Methods, John Wiley and Sons, 1969.

2. Strang, G, Introduction to Applied Mathematics, Wellesley-Cambridge Press, 1986.

3. Martinelli, R, Data Smoothing by Vector Space Methods, Haiku Laboratories, 2006.