Section 3.4 Linear Regression and Lines of Best Fit
Subsection 3.4.1 Overview
In the activities and homework for Section 3.3, you might have found a different equation than other members of your group or class. Even if your equations had similar slopes or \(y\)-intercepts, it might be disconcerting to use different equations to model the same set of data.
Subsection 3.4.2 Lines of Best Fit
Statisticians worked on the problem of finding lines of best fit over many years. Initially, the idea was to get a line that had as many points above the line as below so that the sum of the distances of points away from the line added to 0. Even though the process worked fairly well, the same problem occurred as in our work with eyeballed lines of best fit. Scientists were not satisfied with estimates that worked most of the time. They wanted precision and a method that was mathematically justifiable. Carl Friedrich Gauss developed the method of least squares to solve celestial problems.
To understand the method of least squares, consider an example. In Figure 3.4.2.1, three points and an eyeballed line of best fit are indicated. Each point is connected to the line with a segment the length of which is a residual, the directed distance of the point from the eyeballed line of best fit; a residual is positive if the point lies above the line and negative if the point lies below the line. Notice that the sum of the residuals is 0. In Figure 3.4.2.2, each residual is the same as in Figure 3.4.2.1 but this time, a square with side length equal to the residual is drawn. In the method of least squares, the squares of the residuals are found; these are the areas of the squares. The one line that best fits all of the points is the line for which the sum of the areas of the squares is least.
Student Page 3.4.3 The Wave
Let's try another experiment. Think about the Wave as it happens at a football game. Someone starts it and gets most of the people in the stadium standing up and sitting down as the wave approaches then leaves them. We'll practice the wave in the same way, keeping track of how long it takes to complete the wave based on the number of participants in the wave.
Use Table 3.4.3.1 to record the amount of time it takes to complete a wave for the given number of participants.
Number of Participants | Time to Complete the Wave |
---|---|
5 | |
10 | |
15 | |
20 | |
25 | |
30 | |
35 |
1.
Plot the data.
2.
Describe the graph.
3.
Eyeball a line of best fit.
4.
Find the equation of the line you just found.
5.
What does the slope represent in this context?
6.
Wat does the \(y\)-intercept represent in this context?
Activity 7.
As you might have guessed, there are electronic tools to find regression equations and lines of best fit. Both Desmos and graphing calculators have tools that will determine lines of best fit using the method of least squares. The graphing calculator directions are included at the end of the homework in this lesson. Desmos includes directions in the on-line graphing calculator. To find the Regression directions on Desmos, click on the “?” button in the upper right corner of the screen. At the top of the screen that appears, there is a list of Tours. Choose “Regressions”. Follow the directions using the data set Desmos provides. Use Desmos' directions to enter and analyze additional data sets.
In this lesson, use Desmos or a graphing calculator to find a regression equation to fit the Wave data. How close does your eyeballed line of best fit approximate the regression line? How long would it take to complete the wave if 100 people participated? How many people would have to participate to keep the wave going for 10 minutes?
Homework 3.4.4 Homework
1.
Each space in the Monopoly board game is numbered with GO being space 0. The price of the property by space number is provided.
(a)
Plot the data.
(b)
Use the regression capability of an electronic tool to find a regression equation to fit the data at right. Record the equation.
(c)
What is the slope? What does the slope represent in this context?
(d)
What is the \(y\)-intercept? What does the \(y\)-intercept represent in this context?
(e)
Compare the slope and \(y\)-intercept from your regression line with those of the eyeballed line of best fit (see Exercise 3.3.4.2). How close was your eyeballed line to the line you found electronically? Would your eyeballed line be a reasonable approximation for the regression line? Why or why not?
(f)
Does the regression line agree with the price for property 18 in Table 3.4.4.1?
Space Number | Price of Property |
Space Number | Price of Property |
---|---|---|---|
1 | 60 | 21 | 220 |
3 | 60 | 23 | 220 |
6 | 100 | 24 | 240 |
8 | 100 | 26 | 260 |
9 | 120 | 27 | 260 |
11 | 140 | 29 | 280 |
13 | 140 | 31 | 300 |
14 | 160 | 32 | 300 |
16 | 180 | 34 | 320 |
18 | 180 | 37 | 350 |
19 | 200 | 39 | 400 |
(g)
What property number would cost $500 if the board was expanded to include more spaces?
2.
In the table at right, several Monopoly properties were deleted from Table 3.4.4.1. Only the third property in each group of three of the same colored properties remain in this table. Space numbers and prices are provided.
Space Number | Price of Property |
---|---|
9 | 120 |
14 | 160 |
19 | 200 |
24 | 240 |
29 | 280 |
34 | 320 |
(a)
What is the slope? What does the slope mean in this context?
(b)
What is the \(y\)-intercept? What does the \(y\)-intercept mean in this context?
(c)
Determine an equation to fit this new table.
(d)
How does the equation fit the data in this table? Explain.
(e)
Draw the graph of this line using another color on the scatterplot you created in Exercise 3.4.4.1. What do you notice?
(f)
If every space had a price associated with it, how much would space 25 cost?
(g)
Which space number would cost $500?
3.
Revisit the problems below.
Find regression equations to fit each data set.
Compare the regression equation with the equation you found earlier. Comment on the accuracy of your previous work when compared with the regression equation.
Write and answer a meaningful question in context that requires you to solve the regression equation using your choice of a value of the dependent variable.
Write and answer a meaningful question in context that requires you to solve the regression equation using your choice of a value of the independent variable.
(a)
Packaging Stacked Cups (Exercise 3.2.3.1 and Exercise 3.2.3.2): Choose one of the class data sets. Find a regression equation to fit the average height of the cups depending on the number of cups stacked.
(b)
Gulliver Graphs (Section 3.3): Choose one of the data sets, (Thumb, Wrist), (Wrist, Neck), or (Thumb, Neck). Find three regression equations, one for each group, female, male, and entire group.
(c)
U.S. shoe sizes (Exercise 3.3.4.1): Online, find a Men's Shoe Conversion Chart. Compare men's shoe sizes to women's shoe sizes for the same foot lengths. Use foot length in either inches or centimeters as the independent variable and shoe sizes as the dependent variable. Provide a printed list of the data.
4.
World records for human accomplishments in sports over time provide interesting data to analyze. Locate a sport of interest for which world record progressions can be found. For example, the men's world records for the mile run are available online for 1865 to 1999.
(a)
Provide a printed list of the data.
(b)
Using an electronic tool, plot the data with year being the independent variable and world record time being the dependent variable.
(c)
Find a regression equation to fit the data. Graph the equation on the same axes as the scatterplot of the data.
(d)
Interpret the slope and the \(y\)-intercept in terms of the years and world record times.
(e)
What is a reasonable domain for the sport you are investigating? Why do you think so?
(f)
What is the range corresponding to your chosen domain?
(g)
Write and answer a meaningful question in context that requires you to solve the regression equation using your choice of a value of the independent variable.
(h)
Write and answer a meaningful question in context that requires you to solve the regression equation using your choice of a value of the dependent variable.
(i)
Write a paragraph discussing your findings and including your opinion about whether or not a linear regression equation will be a good predictor for future world record winning times and years.
Student Page 3.4.5 Entering and Graphing Data and Finding Regression Equations
1.
To Enter the Data:
(a)
Press STAT then choose 1: Edit…
(b)
Enter the values for the independent variable (domain) into L1.
(c)
Enter the values for the dependent variable (range) into L2.
2.
To Graph the Data:
(a)
Press STAT PLOT (2ND Y=) then choose 1: Plot1… Press ENTER
(b)
Settings: On, Scatter plot (first graph), Xlist: L1, Ylsit: L2, Mark: +
(c)
Press GRAPH
3.
To Set the Viewing Window:
(a)
Press WINDOW
(b)
Use the table to determine
Xmin, Xmax, Xscl
Ymin, Ymax, Yscl
(c)
OR Press ZOOM then scroll down to 9: ZoomStat, Press ENTER
4.
To Fit a Function to the Data:
(a)
Press STAT
(b)
Move the cursor to the right to highlight CALC
(c)
Scroll down to the function type you want to use, for example, 4: LinReg(ax+b) (to fit a linear function), then Press ENTER.
(d)
Indicate the lists of the data (separated by commas for older TI-84 calculators, L1, L2).
(e)
Choose where you want to place the function equation, for example, Y1. Press VARS, move the cursor to the right to highlight Y-VARS, Press ENTER to choose 1: Function, scroll to highlight the Y-variable you want, then press ENTER again
(f)
The screen should read:
LinReg(ax+b) Xlist: L1 Ylist: L2 FreqList: Store RegEQ: Y1 Calculate
(For older TI-84s, the screen will read 4: LinReg(ax+b) L1, L2, Y1). Press ENTER (to fit a linear function).
(g)
The regression equation will be in Y1 or whatever Y-variable you chose.
(h)
Press GRAPH to plot the regression equation with the data.