Skip to main content

Section 26 Least Squares Approximations

Subsection Application: Fitting Functions to Data

Data is all around us. Data is collected on almost everything and it is important to be able to use data to make predictions. However, data is rarely well-behaved and so we generally need to use approximation techniques to estimate from the data. One technique for this is least squares approximations. As we will see, we can use linear algebra to fit a variety of different types of curves to data.

Subsection Introduction

In this section our focus is on fitting linear and polynomial functions to data sets.

Preview Activity 26.1.

NBC was awarded the U.S. television broadcast rights to the 2016 and 2020 summer Olympic games. Table 26.1 lists the amounts paid (in millions of dollars) by NBC sports for the 2008 through 2012 summer Olympics plus the recently concluded bidding for the 2016 and 2020 Olympics, where year 0 is the year 2008. (We will assume a simple model here, ignoring variations such as value of money due to inflation, viewership data which might affect NBC's expected revenue, etc.) Figure 26.2 shows a plot of the data. Our goal in this activity is to find a linear function f defined by f(x)=a0+a1x that fits the data well.

Table 26.1. Olympics television broadcast rights.
Year Amount
0 894
4 1180
8 1226
12 1418
Figure 26.2. A plot of the data.

If the data were actually linear, then the data would satisfy the system

a0+0a1=894a0+4a1=1180a0+8a1=1226a0+12a1=1418.

The vector form of this equation is

a0[1 1 1 1]T+a1[0 4 8 12]T=[894 1180 1226 1418]T.

This equation does not have a solution, so we seek the best approximation to a solution we can find. That is, we want to find a0 and a1 so that the line f(x)=a0+a1x provides a best fit to the data.

Letting v1=[1 1 1 1]T and v2=[0 4 8 12]T, and b=[894 1180 1226 1418]T, our vector equation becomes

a0v1+a1v2=b.

To make a best fit, we will minimize the square of the distance between b and a vector of the form a0v1+a1v2. That is, minimize

(26.1)||b(a0v1+a1v2)||2.

Rephrasing this in terms of projections, we are looking for the vector in W=Span{v1,v2} that is closest to b. In other words, the values of a0 and a1 will occur as the weights when we write projWb as a linear combination of v1 and v2. The one wrinkle in this problem is that we need an orthogonal basis for W to find this projection. Use appropriate technology throughout this activity.

(a)

Find an orthogonal basis B={w1,w2} for W.

(b)

Use the basis B to find y=projWb as illustrated in Figure 26.3.

Figure 26.3. Projecting b onto W.
(c)

Find the values of a0 and a1 that give our best fit line by writing y as a linear combination of v1 and v2.

(d)

Draw a picture of your line from the previous part on the axes with the data set. How well do you think your line approximates the data? Explain.

Subsection Least Squares Approximations

In Section 25 we saw that the projection of a vector v in Rn onto a subspace W of Rn is the best approximation to v of all the vectors in W. In fact, if v=[v1 v2  vn]T and projWv=[w1 w2 w3  wm]T, then the error in approximating v by w is given by

||vprojWv||2=i=1m(viwi)2.

In the context of Preview Activity 26.1, we projected the vector b onto the span of the vectors v1=[1 1 1 1]T and v2=[0 4 8 12]T. The projection minimizes the distance between the vectors in W and the vector b (as shown in Figure 26.3), and also produces a line which minimizes the sums of the squares of the vertical distances from the line to the data set as illustrated in Figure 26.4 with the olympics data. This is why these approximations are called least squares approximations.

Figure 26.4. Least squares linear approximation

While we can always solve least squares problems using projections, we can often avoid having to create an orthogonal basis when fitting functions to data. We work in a more general setting, showing how to fit a polynomial of degree n to a set of data points. Our goal is to fit a polynomial p(x)=a0+a1x+a2x2++anxn of degree n to m data points (x1,y1), (x2,y2), , (xm,ym), no two of which have the same x coordinate. In the unlikely event that the polynomial p(x) actually passes through the m points, then we would have the m equations

(26.2)y1=a0+a1x1+a2x12++an1x1n1+anx1n(26.3)y2=a0+a1x2+a2x22++an1x2n1+anx2n(26.4)y3=a0+a1x3+a2x32++an1x3n1+anx3n(26.5)(26.6)ym=a0+a1xm+a2xm2++an1xmn1+anxmn

in the n+1 unknowns a0, a1, , an1, and an.

The m data points are known in this situation and the coefficients a0, a1, , an are the unknowns. To write the system in matrix-vector form, the coefficient matrix M is

M=[1x1x12x1n1x1n1x2x22x2n1x2n1x3x32x3n1x3n1xmxm2xmn1xmn],

while the vectors a and y are

a=[a0a1a2an1an] and y=[y1y2y3ym1ym].

Letting y=[y1 y2  ym]T and vi=[x1i1 x2i1  xmi1]T for 1in+1, the vector form of the system is

y=a0v1+a1v2++anvn+1.

Of course, it is unlikely that the m data points already lie on a polynomial of degree n, so the system will usually have no solution. So instead of attempting to find coefficients a0, a1, , an that give a solution to this system, which may be impossible, we instead look for a vector that is ``close" to a solution. As we have seen, the vector projWy, where W is the span of the columns of M, minimizes the sum of the squares of the differences of the components. That is, our desired approximation to a solution to Mx=y is the projection of y onto Col M. Now projWy is a linear combination of the columns of M, so projWy=Ma for some vector a. This vector a then minimizes ||projWy||=||yMa||. That is, if we let (Ma)T=[b1 b2 b3  bm], we are minimizing

(26.7)||yMa||2=(y1b1)2+(t2b2)2++(ymbm)2.

The expression ||yMa||2 measures the error in our approximation.

The question we want to answer is how we can find the vector a that minimizes ||yMa|| in a way that is more convenient than computing a projection. We answer this question in a general setting in the next activity.

Activity 26.2.

Let A be an m×n matrix and let b be in Rm. Let W=Col A. Then projWb is in Col A, so let x be in Rn such that Ax=projWb.

(a)

Explain why bAx is orthogonal to every vector of the form Ax, for any x in Rn. That is, bAx is orthogonal to Col A.

(b)

Let ai be the ith column of A. Explain why ai(bAx)=0. From this, explain why AT(bAx)=0.

(c)

From the previous part, show that x satisfies the equation

ATAx=ATb.

The result of Activity 26.2 is that we can now do least squares polynomial approximations with just matrix operations. We summarize this in the following theorem.

The equations in the system (26.8) are called the normal equations for Ax=b. To illustrate, with the Olympics data, our data points are (0,894), (4,1180), (8,1226), (12,1418) with y=[894 1180 1226 1418]T. So M=[101418112]. Notice that MTM is invertible, to find the degree 1 approximation to the data, technology shows that

a=(MTM)1MTy=[46845 80920]

just as in Preview Activity 26.1.

Activity 26.3.

Now use the least squares method to find the best polynomial approximations (in the least squares sense) of degrees 2 and 3 for the Olympics data set in Table 26.1. Which polynomial seems to give the “best” fit? Explain why. Include a discussion of the errors in your approximations. Use your “best” least squares approximation to estimate how much NBC might pay for the television rights to the 2024 Olympic games. Use technology as appropriate.

The solution with our Olympics data gave us the situation where MTM was invertible. This corresponded to a unique least squares solution (MTM)1MTy. It is reasonable to ask when this happens in general. To conclude this section, we will demonstrate that if the columns of a matrix A are linearly independent, then ATA is invertible.

Activity 26.4.

Let A be an m×n matrix with linearly independent columns.

(a)

What must be the relationship between m and n? Explain.

(b)

We know that an n×n matrix M is invertible if and only if the only solution to the homogeneous system Mx=0 is x=0. Note that ATA is an n×n matrix. Suppose that ATAx=0 for some x in Rn.

(i)

Show that ||Ax||=0.

Hint.

What is xT(ATAx)?

(ii)

What does ||Ax||=0 tell us about x in relation to Nul A? Why?

(iii)

What is Nul A? Why? What does this tell us about x and then about ATA?

We summarize the result of Activity 26.4 in the following theorem.

If the columns of A are linearly dependent, we can still solve the normal equations, but will obtain more than one solution. In a later section we will see that we can also use a pseudoinverse in these situations.

Subsection Examples

What follows are worked examples that use the concepts from this section.

Example 26.7.

According to the Centers for Disease Control and Prevention 45 , the average length of a male infant (in centimeters) in the US as it ages (with time in months from 1.5 to 8.5) is given in Table 26.8.

Table 26.8. Average lengths of male infants
Age (months) 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5
Average Length (cm) 56.6 59.6 62.1 64.2 66.1 67.9 69.5 70.9

In this problem we will find the line and the quadratic of best fit in the least squares sense to this data. We treat age in months as the independent variable and length in centimeters as the dependent variable.

(a)

Find a line that is the best fit to the data in the least squares sense. Draw a picture of your least squares solution against a scatterplot of the data.

Solution.

We assume that a line of the form f(x)=a1x+a0 contains all of the data points. The first data point would satisfy 1.5a1+a0=56.6, the second 2.5a1+a0=59.6, and so on, giving us the linear system

31.5a1+a0=56.62.5a1+a0=59.63.5a1+a0=62.14.5a1+a0=64.25.5a1+a0=66.16.5a1+a0=67.97.5a1+a0=69.58.5a1+a0=70.9.

Letting

A=[1.512.513.514.515.516.517.518.51], x=[a1a0],  and  b=[56.659.662.164.266.167.969.570.9],

we can write this system in the matrix form Ax=b. Neither column of A is a multiple of the other, so the columns of A are linearly independent. The least squares solution x to the system is then found by

x=(ATA)1ATb.

Technology shows that (with entries rounded to 3 decimal places), (ATA)1AT is

[0.0830.0600.0360.0120.0120.0360.0600.0830.5420.4230.3040.1850.0650.0540.1730.292],

and

x[2.01154.559].

So the least squares linear function to the data is f(x)2.011x+54.559. A graph of f against the data points is shown at left in Figure 26.9.

Figure 26.9. Left: Least squares line. Right: Least squares quadratic.

(b)

Now find the least squares quadratic of the form q(x)=a2x2+a1x+a0 to the data. Draw a picture of your least squares solution against a scatterplot of the data.

Solution.

The first data point would satisfy (1.52)a2+1.5a1+a0=56.6, the second (2.5)2a2+2.5a1+a0=59.6, and so on, giving us the linear system

1.52a2+1.5a1+a0=56.62.52a2+2.5a1+a0=59.63.52a2+3.5a1+a0=62.14.52a2+4.5a1+a0=64.25.52a2+5.5a1+a0=66.16.52a2+6.5a1+a0=67.97.52a2+7.5a1+a0=69.58.52a2+8.5a1+a0=70.9.

Letting

A=[1.521.512.522.513.523.514.524.515.525.516.526.517.527.518.528.51], x=[a2a1a0],  and  b=[56.659.662.164.266.167.969.570.9],

we can write this system in the matrix form Ax=b. Technology shows that every column of the reduced row echelon form of A contains a pivot, so the columns of A are linearly independent. The least squares solution x to the system is then found by

x=(ATA)1ATb.

Technology shows that (with entries rounded to 3 decimal places) (ATA)1 is

[0.0420.0060.0180.0300.0300.0180.0060.0420.5000.1190.1430.2860.3100.2140.0000.3331.3650.5400.0490.4030.5220.4060.0550.531],

and

x[0.1183.19552.219].

So the least squares quadratic function to the data is q defined by q(x)0.118x2+3.195x+52.219. A graph of q against the data points is shown at right in Figure 26.9.

Example 26.10.

Least squares solutions can be found through a QR factorization, as we explore in this example. Let A be an m×n matrix with linearly independent columns and QR factorization A=QR. Suppose that b is not in Col A so that the system Ax=b is inconsistent. We know that the least squares solution x to Ax=b is

(26.9)x=(ATA)1ATb.
(a)

Replace A by its QR factorization in (26.9) to show that

x=R1QTb.
Hint.

Use the fact that Q is orthogonal and R is invertible.

Solution.

Replacing A with QR and using the fact that R is invertible and Q is orthogonal to see that

x=(ATA)1ATb=((QR)TQR)1(QR)Tb=(RTQTQR)RTQTb=(RTR)1RTQTb=R1(RT)1RTQTb=R1QTb.

So if A=QR is a QR factorization of A, then the least squares solution to Ax=b is R1QTb.

(b)

Consider the data set in Table 26.11, which shows the average life expectance in years in the US for selected years from 1950 to 2010.

Table 26.11. Life expectancy in the US
year 1950 1965 1980 1995 2010
age 68.14 70.21 73.70 75.98 78.49
(Data from macrotrends 46 .)
(i)

Use (26.9) to find the least squares linear fit to the data set.

Solution.

A linear fit to the data will be provided by the least squares solution to Ax=b, where

A=[1195011965119801199512010], x=[ab],  and  b=[68.1470.2173.7075.9878.49].

Technology shows that

(ATA)1ATb[276.1000 0.1765]T.
(ii)

Use appropriate technology to find the QR factorization of an appropriate matrix A, and use the QR decomposition to find the least squares linear fit to the data. Compare to what you found in part i.

Solution.

Technology shows that A=QR, where

Q=[15210151101501511015210] and R=[51980501510].

Then we have that

R1QTb[276.1000 0.1765]T,

just as in part i.

Subsection Summary

  • A least squares approximation to Ax=b is found by orthogonally projecting b onto Col A.

  • If the columns of A are linearly independent, then the least squares approximation to Ax=b is (ATA)1ATb.

  • The least squares solution to Ax=b, where Ax=[y1 y2  ym] and b=[b1 b2  bm]T, minimizes the distance ||Axb||, where

    ||Axb||2=(y1b1)2+(y2b2)2++(ymbm)2.

    So the least squares solution minimizes a sum of squares.

Exercises Exercises

1.

The University of Denver Infant Study Center investigated whether babies take longer to learn to crawl in cold months, when they are often bundled in clothes that restrict their movement, than in warmer months. The study sought a relationship between babies' first crawling age and the average temperature during the month they first try to crawl (about 6 months after birth). Some of the data from the study is in Table 26.12. Let x represent the temperature in degrees Fahrenheit and C(x) the average crawling age in months.

Table 26.12. Crawling age
x 33 37 48 57
C(x) 33.83 33.35 33.38 32.32
(a)

Find the least squares line to fit this data. Plot the data and your line on the same set of axes. (We aren't concerned about whether a linear fit is really a good choice outside of this data set, we just fit a line to it to see what happens.)

(b)

Use your least squares line to predict the average crawling age when the temperature is 65.

2.

The cost, in cents, of a first class postage stamp in years from 1981 to 1995 is shown in Table 26.13.

Table 26.13. Cost of postage
Year 1981 1985 1988 1991 1995
Cost 20 22 25 29 32
(a)

Find the least squares line to fit this data. Plot the data and your line on the same set of axes.

(b)

Now find the least squares quadratic approximation to this data. Plot the quadratic function on same axes as your linear function.

(c)

Use your least squares line and quadratic to predict the cost of a postage stamp in this year. Look up the cost of a stamp today and determine how accurate your prediction is. Which function gives a better approximation? Provide reasons for any discrepancies.

3.

According to The Song of Insects by G.W. Pierce (Harvard College Press, 1948) the sound of striped ground crickets chirping, in number of chirps per second, is related to the temperature. So the number of chirps per second could be a predictor of temperature. The data Pierce collected is shown in the table and scatterplot below, where x is the (average) number of chirps per second and y is the temperature in degrees Fahrenheit.

x y
20.0 88.6
16.0 71.6
19.8 93.3
18.4 84.3
17.1 80.6
15.5 75.2
14.7 69.7
17.1 82.0
15.4 69.4
16.2 83.3
15.0 79.6
17.2 82.6
16.0 80.6
17.0 83.5
14.4 76.3

The relationship between x and y is not exactly linear, but looks to have a linear pattern. It could be that the relationship is really linear but experimental error causes the data to be slightly inaccurate. Or perhaps the data is not linear, but only approximately linear. Find the least squares linear approximation to the data.

4.

We showed that if the columns of A are linearly independent, then ATA is invertible. Show that the reverse implication is also true. That is, show that if ATA is invertible, then the columns of A are linearly independent.

5.

Consider the small data set of points S={(2,1),(2,2),(2,3)}.

(a)

Find a linear system Ax=b whose solution would define a least squares linear approximation to the data in set S.

(b)

Explain what happens when we attempt to find the least squares solution x using the matrix (ATA)1AT. Why does this happen?

(c)

Does the system Ax=b have a least squares solution? If so, how many and what are they? If not, why not?

(d)

Fit a linear function of the form x=a0+a1y to the data. Why should you have expected the answer?

6.

Let M and N be any matrices such that MN is defined. In this exercise we investigate relationships between ranks of various matrices.

(a)

Show that Col MN is a subspace of Col M. Use this result to explain why rank(MN)rank(M).

(b)

Show that rank(MTM)=rank(M)=rank(MT).

Hint.

For part, see Exercise 12 in Section 15.

(c)

Show that rank(MN)min{rank(M),rank(N)}.

7.

We have seen that if the columns of a matrix M are linearly independent, then

a=(MTM)1MTb

is a least squares solution to Ma=y. What if the columns of M are linearly dependent? From Activity 26.1, a least squares solution to Ma=y is a solution to the equation (MTM)a=MTy. In this exercise we demonstrate that (MTM)a=MTy always has a solution.

(a)

Explain why it is enough to show that the rank of the augmented matrix [MTM | MTy] is the same as the rank of MTM.

(c)

Explain why [MTM | MTy]=MT[M | y].

Hint.

Use the definition of the matrix product.

(e)

Finally, explain why rank([MTM | MTy])=rank(MTM).

Hint.

Combine parts (b) and (d).

8.

If A is an m×n matrix with linearly independent columns, the least squares solution x=(ATA)1ATb to Ax=b has the property that Ax=A(ATA)1ATb is the vector in Col A that is closest to b. That is, A(ATA)1ATb is the projection of b onto Col A. The matrix P=A(ATA)1AT is called a projection matrix. Projection matrices have special properties.

(a)

Show that P2=P=PT.

(b)

In general, we define projection matrices as follows.

Definition 26.14.

A square matrix E is a projection matrix if E2=E.

Show that E=[0101] is a projection matrix. Onto what does E project?

(c)

Notice that the projection matrix from part (b) is not an orthogonal matrix.

Definition 26.15.

A square matrix E is a orthogonal projection matrix if E2=E=ET.

Show that E=[12121212] is an orthogonal projection matrix. Onto what does E project?

(d)

If E is an n×n orthogonal projection matrix, show that if b is in Rn, then bEb is orthogonal to every vector in Col E. (Hence, E projects orthogonally onto Col E.)

(e)

Recall the projection projvu of a vector u in the direction of a vector v is give by projvu=uvvvv. Show that projvu=Eu, where E is the orthogonal projection matrix 1vv(vvT). Illustrate with v=[1 1]T.

9.

Label each of the following statements as True or False. Provide justification for your response.

(a) True/False.

If the columns of A are linearly independent, then there is a unique least squares solution to Ax=b.

(b) True/False.

Let {v1,v2,,vn} be an orthonormal basis for Rn. The least squares solution to [v1 v2  vn1]x=vn is x=0.

(c) True/False.

The least squares line to the data points (0,0), (1,0), and (0,1) is y=x.

(d) True/False.

If the columns of a matrix A are not invertible and the vector b is not in Col A, then there is no least squares solution to Ax=b.

(e) True/False.

Every matrix equation of the form Ax=b has a least squares solution.

(f) True/False.

If the columns of A are linearly independent, then the least squares solution to Ax=b is the orthogonal projection of b onto Col A.

Subsection Project: Other Least Squares Approximations

In this section we learned how to fit a polynomial function to a set of data in the least squares sense. But data takes on many forms, so it is important to be able to fit other types of functions to data sets. We investigate three different types of regression problems in this project.

Project Activity 26.5.

The length of a species of fish is to be represented as a function of the age and water temperature as shown in the table on the next page. 47  The fish are kept in tanks at 25, 27, 29 and 31 degrees Celsius. After birth, a test specimen is chosen at random every 14 days and its length measured. The data include:

  • I, the index;

  • x, the age of the fish in days;

  • y, the water temperature in degrees Celsius;

  • z, the length of the fish.

Since there are three variables in the data, we cannot perform a simple linear regression. Instead, we seek a model of the form

f(x,y)=ax+by+c

to fit the data, where f(x,y) approximates the length. That is, we seek the best fit plane to the data. This is an example of what is called multiple linear regression. A scatterplot of the data, along with the best fit plane, is also shown.

(a)

As we did when we fit polynomials to data, we start by considering what would happen if all of our data points satisfied our model function. In this case our data points have the form (x1,y1,z1), (x2,y2,z2), , (xm,ym,zm). Explain what system of linear equations would result if the data points actually satisfy our model function f(x,y)=ax+by+c. (You don't need to write 44 different equations, just explain the general form of the system.)

(b)

Write the system from (a) in the form Ma=z, and specifically identify the matrix M and the vectors a and z.

(c)

The same derivation as with the polynomial regression models shows that the vector a that minimizes ||zMa|| is found by

a=(MTM)1MTz,

Use this to find the least squares fit of the form f(x,y)=ax+by+c to the data.

(d)

Provide a numeric measure of how well this model function fits the data. Explain.

Index Age Temp (C) Length
1 14 25 620
2 28 25 1315
3 41 25 2120
4 55 25 2600
5 69 25 3110
6 83 25 3535
7 97 25 3935
8 111 25 4465
9 125 25 4530
10 139 25 4570
11 153 25 4600
12 14 27 625
13 28 27 1215
14 41 27 2110
15 55 27 2805
16 69 27 3255
17 83 27 4015
18 97 27 4315
19 111 27 4495
20 125 27 4535
21 139 27 4600
22 153 27 4600
23 14 29 590
24 28 29 1305
25 41 29 2140
26 55 29 2890
27 69 29 3920
28 83 29 3920
29 97 29 4515
30 111 29 4520
31 125 29 4525
32 139 29 4565
33 153 29 4566
34 14 31 590
35 28 31 1205
36 41 31 1915
37 55 31 2140
38 69 31 2710
39 83 31 3020
40 97 31 3030
41 111 31 3040
42 125 31 3180
43 139 31 3257
44 153 31 3214

Project Activity 26.6.

Population growth is typically not well modeled by polynomial functions. Populations tend to grow at rates proportional to the population, which implies exponential growth. For example, Table 26.16 shows the approximate population of the United States in years between 1920 and 2000, with the population measured in millions.

Table 26.16. U.S. population
Year 1920 1930 1940 1950 1960 1970 1980 1990 2000
Population 106 123 142 161 189 213 237 259 291

If we assume the population grows exponentially, we would want to find the best fit function f of the form f(t)=aekt, where a and k are constants. However, an exponential function is not linear. So to apply the methods we have developed, we could instead apply the natural logarithm to both sides of y=aekt to obtain the equation ln(y)=ln(a)+kt. We can then find the best fit line to the data in the form (t,ln(y)) to determine the values of ln(a) and k. Use this approach to find the best fit exponential function in the least squares sense to the U.S. population data. Then look up the U.S. population in 2010 (include your source) and compare to the estimate given by your model function. If your prediction is not very close, give some plausible explanations for the difference.

Figure 26.17. Best fit ellipse.

Project Activity 26.7.

Carl Friedrich Gauss is often credited with inventing the method of least squares. He used the method to find a best-fit ellipse which allowed him to correctly predict the orbit of the asteroid Ceres as it passed behind the sun in 1801. (Adrien-Marie Legendre appears to be the first to publish the method, though.) Here we examine the problem of fitting an ellipse to data.

An ellipse is a quadratic equation that can be written in the form

(26.10)x2+By2+Cxy+Dx+Ey+F=0

for constants B, C, D, E, and F, with B>0. We will find the best-fit ellipse in the least squares sense through the points

(0,2), (2,1), (1,1), (1,2), (3,1),  and  (1,1).

A picture of the best fit ellipse is shown in Figure 26.17.

(a)

Find the system of linear equations that would result if the ellipse (26.10) were to exactly pass through the given points.

(b)

Write the linear system from part (a) in the form Ax=b, where the vector x contains the unknowns in the system. Clearly identify A, x, and b.

(c)

Find the least squares ellipse to this set of points. Make sure your method is clear. (Note that we are really fitting a surface of the form f(x,y)=x2+By2+Cxy+Dx+Ey+F to a set of data points in the xy-plane. So the error is the sum of the vertical distances from the points in the xy-plane to the surface.)

cdc.gov/growthcharts/html_charts/lenageinf.htm
macrotrends.net/countries/USA/united-states/life-expectancy
Data from Mathematical Algorithms for Linear Regression, Helmut Spaeth, Academic Press, 1991, page 305, ISBN 0-12-656460-4.