Data-Smoothing By Vector Space Methods
by Rick Martinelli, Haiku
Laboratories June 2003
Copyright ©, 2003
Updated November 2006
Introduction
Vector Spaces and Least Squares
Projection Matrix
An
Example From The Stock Market
References
INTRODUCTION
Suppose a time-varying process x(t) is measured at
regular intervals, and it is known that the measurements are contaminated with
noise. If we let z(k) represent the measurement
at the kth interval, the situation may be represented by
(1) z(k) = x(k) + y(k) k = 1,2,...,N,
where {x(k)} is called the process, {z(k)} is
called the data, {y(k)} are samples from a zero-mean random sequence
with fixed variance σ2, and N is the number of
measurements. The data‑smoothing problem is to estimate the
process {x(k)} from the data {z(k)}. This is a centuries old problem, first
addressed by the likes of Gauss and Legendre who formulated the first
least-squares estimates.
Figure 1 shows a set of
simulated measurements (the data) as green circles, together with the underlying
process (blue line). The process is a uniform sampling of N=81 points of
the exponential function exp(‑t2). The noise sequence is machine-generated random
numbers from a uniform distribution with mean 0 and variance 1. The noise was treated to remove low frequency
components as described below.

Figure 1. Simulated data (green circles) and the underlying process. Data points are connected by dashed lines.
The problem is to retrieve a good approximation of the process from the noisy data. The usual approach is to impose some sort of smoothness condition on x, either by way of a process model, as in the case of least-squares estimates, or frequency restrictions as used in signal processing. More sophisticated approaches include a smoothness parameter that determines the trade-off between fidelity to a model and minimization of the error variance (see [1] for comparisons between approaches.).
The technique described here is a modified basis-expansion method (see [1]) which combines features from both the least-squares and signal processing approaches. Generic modeling functions are chosen so that the process contains a small number of the slowly varying (low frequency) components of the data, including the mean, and the noise contains all other components. Then the least-squares estimate of the process is calculated by applying a linear transformation, in the form of a projection matrix, to the data. This results in a very fast algorithm called a lowpass filter that may be used in real-time applications.

Figure 2. Smoothed estimates (circles)
and the underlying process.
Figure 2 shows the
results of this procedure applied to the data in Figure 1. Smoothed
estimates are shown as individual circles; the process is shown as a solid
line. The modeling functions used here are
(2)
{1,
Cos(πt), Cos(2πt), Cos(3πt), Cos(4πt)}
The noise is found by
subtracting the estimates from the data and is shown in Figure 3.

Figure 3.
Noise estimate: Data minus smoothed estimate.
In the next section we
provide an outline of the aspects of vector spaces and the least-squares method
needed to develop the projection matrix. In section 3, the projection
matrix is calculated and some of its properties described. In the last
section, the method is used to smooth real stock-market data to extract the
dominant frequency components. The vector space techniques used here may be
found in references [2] and [3]; there is nothing new here except possibly the
explicit calculation of the projection matrix from the modeling functions in
Theorems 1 and 2.
The following is an outline of the machinery required to calculate the projection matrix. An N-dimensional vector space H is a set of N-tuples
v = (v(1), v(2),…, v(N))
where the v(k) are real numbers called the coordinates of v. The zero vector in H is (0, 0,…, 0).
All vector spaces considered in this memo will be N-dimensional vector spaces for some fixed N. Such spaces are generalizations of the 2-dimensional plane where vectors are represented by arrows.
Example. Equation (1) in vector notation is
(3) z = x + y.
Let S = {v1, v2,…, vn} be a set of vectors in H. A linear combination of the vectors in S is a sum of the form
a1v1 + a2v2 +…+ anvn
where the ak are real constants. The set of all such
linear combinations is called the subspace generated by S and is denote
by [S].
Example. The set S of polynomials
S = {Vj(k) = (k/N) j , k =1,…,N}
where j=0,…,N-1 is a basis for H.
Example. The modeling functions in expression (2) above may be sampled to yield the set
S = {1, Cos(πkt/N), Cos(2πkt/N), Cos(3πkt/N), Cos(4πkt/M)}
which forms a basis for the 4-dimensional subspace [S] of H.
In particular, if the set S contains N linearly independent vectors, then S is a basis for H, and every vector in H is a linear combination of the basis vectors,
(3) v = a1v1 + a2v2 +…+ aNvN
for come constants ak .
Vector Norm: The notion of “vector length” can be generalized from the 2-dimensional case by defining the (squared) norm of a vector v = (v(1), v(2),…, v(N)) in H as
|v|2 = v(1)2 + v(2)2 + ,…,+ v(N)2
This is called the Euclidean norm, and H is called Euclidean N-space. A vector of norm one is called a unit vector.
Inner Product: The notion of dot product between vectors in two dimensions may be generalized to N dimensions by defining the inner product of two vectors v and w in H as
<v, w> = v(1)w(1) + v(2)w(2) +…+ v(N)w(N).
The inner product is symmetric: <v, w> = <w, v>. When N=2 it may be shown that this definition is equivalent to the standard dot product given by
|v||w| Cosine( β )
where β is the angle between the vectors. Thus, the inner product in H generalizes the notion of angle between vectors. With the inner product defined, H becomes a Hilbert space called Euclidean N-space. Note that the inner product <v, v> is the (squared) Euclidean norm of v.
Orthonormal Basis :
Two vectors are orthogonal in H, written v⊥w, whenever <v, w> = 0. Orthogonality
generalizes the 2-dimensional notion of perpendicularity. If two
vectors are orthogonal they are independent, but independent vectors may not be
orthogonal. An orthonormal (ON) basis for H is a set of N mutually
orthogonal unit vectors. Thus the set of vectors Ω = {u1, ..., uN} is an ON basis for H if and
only if <
|<v, w>| ≤ |v||w|.
(Note the double use of "|" as absolute value on the left, and vector norm on the right.)
Fourier Series: If v is a vector in H and Ω is an ON basis for H, then the orthonormal property of Ω implies equation (3) can be written
v = <v, u1> u1 + <v, u2> u2 + ... <v, uN> uN, or
v = F1u1 + F2u2 + ... + FNuN.
This is called the Fourier series representation of v and the inner products Fn = <v, un> are the Fourier coefficients of v relative to Ω.
Example: The Fourier series for z is
(4) z = <z, u1> u1 + <z, u2> u2 + ... + <z, uN> uN
Example: Assume that x is well-approximated by a linear combination of ON vectors {u1, ..., uM} from H, where M < N. Then
(5) x ≈ F1u1 + F2u2 + ... + FMuM
where Fj = <x, uj> are the (unknown) Fourier coefficients of the process x.
Orthogonal Projection: If Ω is an ON basis for H, and G is the subspace generated by a proper subset of Ω, say Ω’ = {u1, ..., uM} from H, where M < N, then
v* = <v, u1> u1 + <v, u2> u2 + ... <v, uM> uM
is called the orthogonal projection of v onto G. Since H uses the Euclidean norm, v* corresponds to the least-squares estimate of v using vectors in G. The subspace generated by the remaining members of Ω, namely {uM+1, ..., uN} is called the orthogonal complement of G and written GC. If v ∈ G and w ∈ GC, then clearly v ⊥ w. Any vector v in H can be written as the direct sum of two orthogonal vectors v = u + w, where u ∈ G and w ∈ GC.
Example: Equation (3) tacitly assumes that the noise y is orthogonal to x, and so z is a direct sum of x and y. Using approximation (5) we have
x = <z, u1> u1 + <z, u2> u2 + ... <z, uM> uM
y = <z, uM+1> uM+1 + <z, uM+2> uM+2 + ... + <z, uN> uN
Least Squares: Equation (4) and approximation (5) may be put in matrix form:
z ≈ Af + y,
where A is an N x M matrix with entries
akj = uj(k), k = 1,…,N, j=1,…,M
and f is an M-dimensional vector of the unknown parameters fj. We seek the estimate f* of f that minimizes
| z – Af |2
It can be shown that the solution f* is the least-squares estimate of f given by
f* = (A'A)-1 A'z
where the prime ' denotes matrix transpose (See [3], section 4.3). Hence the least-squares estimate of x given z is
x* = Pz
where
P = A(A'A)-1A'
is an N x N matrix. This shows that the least-squares estimate of x given z is a linear transformation acting on z, and which depends only on the vectors {uj}. It is easily seen that matrix P has the property P2 = P, making it a projection matrix. P is defined below as symmetric making it in fact an orthogonal projection. The expression for P contains a matrix inversion which is computationally costly. Using Fourier series expansions however, allows us to easily calculate P and prove it is the projection onto subspace G. This is done in the next section.
The following theorem calculates the projection matrix P.
Theorem 1. Let H be the Hilbert space of N-tuples of real numbers and let Ω = {u1, ..., uN} be an ON basis for H. Let G be the subspace of H generated by the subset Ω = {u1, ..., uM, M<N}. Let P be the symmetric, N x N matrix with components
(6)
Pk(m) =
uj(k) uj(m)
where k,m = 1,2,…,N and M < N. Then, for any vector v in H, Pv is the orthogonal projection of v onto G.
Proof: Write v as
v = <v, u1> u1 + <v, u2> u2 + ... + <v, uN> uN.
It must be shown that Pv is equal to
v* = <v, u1> u1 + <v, u2> u2 + ... + <v, uM> uM.
Choose a fixed row Pk of P (equivalently, a coordinate v*(k) of v*). Regarding Pk as a row vector and v as a column vector, matrix multiplication of Pk with v can be written as an inner product, and gives the k-th coordinate of v*:
<Pk, v> =
Pk(n) v(n) = ![]()
uj(k) uj(n) v(n)
=
uj(k)
uj(n) v(n) =
uj(k) <uj,v> = v*(k)
Since row k was arbitrary, this is true for all rows of P and so Pv = v*. ▌
Gram-Schmidt: In view of Theorem 1 it is essential to have an ON basis for H that also serves as process model. Fortunately, for every set of N independent vectors there is and equivalent set that is orthonormal. Let {v1, v2,…, vN} be a set of N independent vectors in H. Then a new set {u1, u2, ,…, uN} may be generated that is an ON basis for H by following the Gram-Schmidt procedure. First, normalize v1 to obtain u1:
u1 = v1 / |v1|.
Clearly, u1 and v1 generate the same subspace. Next, u2 is obtained from v2 by subtracting any contribution to v2 by u1, and then normalizing the result again:
w2 = v2 - <v2, u1> u1
u2 = w2 / |w2|.
From <w2, u1> = 0 , it follows that u1 ⊥ u2. Furthermore u1 and u1 generate the same subspace as v1 and v2. Continuing by induction
wn = vn -
<vn,
and
un = wn / |wn|.
for each n ≤ N. By the
same reasoning, the sets {u1, ..., uM}
and {v1, v2,…, vM} generate the same subspace
for each choice of M ≤ N, and
Example: To smooth the data in Figure 1, the Gram-Schmidt procedure was applied to modeling functions
{Cos((n-1)πt), n=1,2,…,81}
These functions were first sampled to produce the set of vectors
(7) {vn(k) = Cos((n-1)πk/81), n=1,2,…,81},
where k = 0, 1, …, 80, then
orthonormalized to get the basis vectors {

Figure 4. Graphs of the first six orthonormal basis vectors generated by the basis vectors in Equation (7).
Data-Smoothing Example: The process vector x shown in Figure 1 was calculated as 81 points of a uniform sampling of exp(-t2) on the interval [-3.3,3.3]. A noise vector was calculated as y = (I – P)w, where I is the identity matrix on H, and w is a vector of machine-generated random numbers from a uniform distribution with mean 0 and variance 1. This insures that x and y are computationally orthogonal. The resulting vector y had mean .01534 and variance 1.01884. The simulated data z was obtained by adding corresponding coordinates of x and y. To smooth the data the first 5 of these basis vectors in Figure 4 were used to calculate P in accordance with equation (6). The result from applying P to the data in Figure 1 is the smoothed estimate Pz represented by the circles in Figure 2. The error is less than .012 over the interval, with the greatest deviations near the end points. Large end-point deviations are typical (see Figure 6).
Row Norms: For each row Pk of P, the computation of xk as <Pk, z> is a weighted average of the coordinates of z. Figure 5 is a plot of five of these rows, for the P matrix used to obtain the smoothed estimates in Figure 2. An animation of all 81 functions would show a traveling wave, beginning high on the left, near 0.4, dropping quickly as it moves to the right with a peak height around 0.2, and finally building up high at the right end. The symmetry of these curves is due to the fact that P itself is symmetric by definition and that P2 = P.

Figure 5. Rows 1, 21, 41, 61, 81 of the P matrix used to obtain the estimates in Figure 2.
The rows Pk of P, regarded as vectors in H, enjoy another property, namely:
Theorem 2: The square roots of the diagonal entries in P are the norms of their respective rows, i.e., |Pk|2 = Pk(k).
Proof:
|Pk|2 =<Pk, Pk> =
Pk(j) Pk (j) =
Pk(j)
un(k)
un(j)
=
un(k)
Pk(j) un(j) =
un(k)
Pk(n) = Pk(k) ▌.
Bias-Variance Tradeoff: The squared-norms |Pk|2 are important because they represent an upper bound on the smoothing capability of their respective rows. Define the estimation error due to P by
x* - x = Pz – x = Px + Py – x = b + Py,
where the bias vector is defined by
b = Px – x,
and Py is the residuals vector. The standard bias-variance decomposition of the squared-error is
|x* - x|2 = |b + Py|2 = |b|2 + |Py|2 + 2|<b, Py>|.
The quantity |Py|2 is called residual variance due to P; it is the amount of noise remaining in the estimate. The quantity |b|2 is called the bias due to P and indicates any lack of fit by the modeling functions. The last term represents any correlation between the bias and residual variance, and is usually assumed to be zero. Notice that, if y is truly orthogonal to G then residual variance is zero, and if x is truly a member of G then P is unbiased and b = 0.
There is a bias-variance tradeoff inherent in this and similar procedures: as model complexity increases, the bias decreases while the variance increases (see reference [1] for more on this). In the current procedure, model complexity is indicated by the dimension of the subspace G. More modeling functions in G means less bias due to P at the cost of increased residual variance.
Noise-Ratio: Define a vector RG, called the noise-ratio due to G, coordinate-wise by
RG(k) = (x(k)* - x(k))2/ |y|2.
RG(k) represents the fraction of noise left in the estimate by the kth row of P. Its dependence on the number, the length and nature of the basis vectors is indicated by the subscript G. Using the bias-variance decomposition coordinate-wise we have
(x(k)* - x(k))2 = (b(k) + <Pk,y>)2 = b(k)2 + <Pk,y>2 ≤ b(k)2 + |Pk|2 |y|2
where Schwartz was used to obtain the inequality, and the residuals and bias vector are assumed uncorrelated. Thus, in the unbiased case, the squared-norms |Pk|2 are upper bounds for the noise-ratios:
RG(k) ≤ |Pk|2
for each k.

Figure 6. Upper bounds for noise-ratios vs. row number for rows of P generated by the basis vectors in Equation (7).
Figure 6 shows plots of |Pk|2 versus k for the basis vectors in equation (7), and subspace dimensions 1 through 6. The first curve, labeled Dim 1, corresponds to a subspace G of dimension 1 generated by u1. The second curve, labeled Dim 2, corresponds to a G of dimension 2 generated by u1 and u2, with similar correspondences for the other curves. Notice that, for fixed k, |Pk|2 increases with the dimension of G, and that, except in the Dim 1 case, is largest near the edges. These curves look the same for different values of N, only the scale changes, uniformly decreasing as N increases. At the midpoint, the value for Dim 1 equals that for Dim2; similarly the value for Dim 3 equals that for Dim4, and so on for higher dimensions. Figure 7 shows Upper bounds for midpoint noise-ratios vs. number of data points for the basis vectors in equation (7) and for subspaces G of odd dimension 1 through 9.

Figure 7. Upper bounds for midpoint noise-ratios vs. number of data points, for subspaces of dimension 1, 3, 5 ,7 and 9 generated by the basis vectors in Equation (7).
Fourier Coefficients: Figures 1 and 2 provide visual evidence for the ability of the projection matrix to accurately extract the underlying process from simulated noisy data. Another way to see this effect is to look at the Fourier coefficients of x, y and z. Figure 8 shows the Fourier coefficients Fn for the data in Figure 1, the estimate in Figure 2 and the noise in Figure 3, relative to the basis vectors in equation (7). Numbers on the x-axis are values of n called the coefficient number. The effectiveness of this procedure can be seen in the clean separation of the data coefficients into the process and the noise coefficients. The process retains the first five coefficients, corresponding to the five ON basis vectors used in the process model.

Figure 8a. Fourier coefficients of the data in Figure 1

Figure 8b. Fourier coefficients of the process estimate in Figure 2

Figure 8c. Fourier coefficients of the noise estimate in Figure 3
Subspace Dimension: We have so far avoided the question of which dimension to choose. In the simulated case above, we have the truth, as seen in Figure 1, so can judge that any dimension less than 5 produces a biased estimate. The situation is different in real-world applications, where the truth is never known. Selecting the dimension size is equivalent to choosing kernel length in kernel smoothing ([1], Chapter 6), the size of the multiplier of the penalty term in penalized fit methods [([1], Chapter 7), and the frequency cutoff in low-pass digital filtering. A low-pass filter is a linear device that removes frequencies above the cutoff frequency, which is exactly what the matrix P did as seen in Figures 8. Each coefficient Fn corresponds to a single, cyclic data-component of fixed period L. If n > 1, then the period of the corresponding component is given by
L = 2N/(n-1)
where N is the number of points. For example, the coefficient at n = 3 corresponds to a component with a period of 40.5 data points. In this memo our choice of subspace dimension is guided by the Fourier coefficients, as seen in the following example.
AN EXAMPLE FROM THE STOCK MARKET
Closing prices for
stocks and other market items are reported at the end of each trading day,
along with the number of shares traded, known as the item’s volume. The
product of these two numbers, called price-volume, is an

Figure 9. Price-volume (x10-8) for Microsoft ending 10/06/06.
Some market analysts search for cyclic patterns in this type of data to help determine market direction. Figure 10 shows Fourier coefficients for the data, which has been adjusted to have mean zero. Removing the mean is equivalent to setting the first coefficient to zero, which emphasizes the other values.

Figure 10. Fourier coefficients for the data in Figure 9 after mean deletion.
There are dominant values
at n=3, corresponding to a slowly varying component, and at n=8, the largest
value. In addition there are some “natural gaps”, where coefficients have
small values,

Figure 11. Data and smoothed estimates using basis vectors in equation (7), and dimensions 4, 9, 12 and 25.
The dominant component at n=8 can be isolated by subtraction and is shown in Figure 12. Its period is seen to be about 23 days, which corresponds to the time between the major peaks in Figure 9. The analyst can use any or all of the smoothings or isolated components for predicting market direction.

Figure 12. Data component corresponding to coefficient n=8.
1. Hastie,
T, Tibshirani, R, and Friedman, J, The Elements of Statistical Learning,
Springer, 2001.
2. Halmos, P, Finite
Dimensional Vector Spaces, Van Nostrand, 1993.
3. Luenberger, D.G., Optimization
by Vector Space Methods, Wiley, 1969.