By the end of this section, you should be able to give precise and thorough answers to the questions listed below. You may want to keep these questions in mind to focus your thoughts as you complete the section.
What is a finite dimensional vector space?
What is the dimension of a finite dimensional vector space? What important result about bases of finite dimensional vector spaces makes dimension well-defined?
What must be true about any linearly independent subset of vectors in a vector space with dimension ? Why?
What must be true about any subset of vectors in a vector space with dimension that spans the vector space? Why?
The discipline of statistics is based on the idea of analyzing data. In large data sets it is usually the case that one wants to understand the relationships between the different data in the set. This can be difficult to do when the data set is large and it is impossible to visually examine the data for patterns. Principal Component Analysis (PCA) is a tool for identifying and representing underlying patterns in large data sets, and PCA has been called one of the most important and valuable linear algebra tools for statistical analysis. PCA is used to transform a collection of variables into a (usually smaller) number of uncorrelated variables called principal components. The principal components form the most meaningful basis from which to view data by removing extraneous information and revealing underlying relationships in the data. This presents a framework for how to reduce a complex data set to a lower dimension while retaining the important attributes of the data set. The output helps the experimenter determine which dynamics in the data are important and which can be ignored.
In Section 15 we learned that any two bases for a subspace of contain the same number of vectors. This allowed us to define the dimension of a subspace of . In this section we extend the arguments we made in Section 15 to arbitrary vector spaces and define the dimension of a vector space.
The main tool we used to prove that any two bases for a subspace of must contain the same number of elements was Theorem 15.1. In this preview activity we will show that the same argument can be used for vector spaces. More specifically, we will prove a special case of the following theorem generalizing Theorem 15.1.
If is a vector space with a basis of vectors, then any subset of containing more than vectors is linearly dependent.
Suppose is a vector space with basis . Consider the set of vectors in . We will show that is linearly dependent using a similar approach to the Preview Activity 15.1.
Since is a basis of , it spans . Using this information, rewrite the vectors in terms of and substitute into the above equation to obtain another equation in terms of .
Since is a basis of , the vectors are linearly independent. Using the equation in the previous part, determine what this means about the coefficients .
Express the conditions on in the form of a matrix-vector equation. Explain why there are infinitely many solutions for 's and why this means the vectors are linearly dependent.
Theorem 33.1 shows that if sets and are finite bases for a vector space , which are linearly independent by definition, then each cannot contain more elements than the other, so the number of elements in each basis must be equal.
Theorem 33.2 states that if a vector space has a basis with a finite number of vectors, then the number of vectors in a basis for that vector space is a well-defined number. In other words, the number of vectors in a basis is an invariant of the vector space. This important number is given a name.
A finite-dimensional vector space is a vector space that can be spanned by a finite number of vectors. The dimension of a non-trivial finite-dimensional vector space is the number of vectors in a basis for . The dimension of the trivial vector space is defined to be 0.
Not every vector space is finite dimensional. We have seen, for example, that the vector space of all polynomials, regardless of degree, is not a finite-dimensional vector space. In fact, the polynomials
are linearly independent, so has an infinite linearly independent set and therefore has no finite basis. A vector space that has an infinite basis is called an infinite dimensional vector space.
Since columns of the identity matrix span and are linearly independent, the columns of form a basis for (the standard basis). Consequently, we have that . In this activity we determine the dimensions of other familiar vector spaces. Find the dimensions of each of the indicated vector spaces. Verify your answers.
Every subspace of a finite-dimensional vector space is a vector space, and since a subspace is contained in a vector space it is natural to think that the dimension of a subspace should be less than or equal to the dimension of the larger vector space. We verify that fact in this section.
There are two items we need to confirm before we can state that a subset of a subspace of a vector space is a basis for : the set must be linearly independent and span . We can reduce the amount of work it takes to show that a set is a basis if we know the dimension of the vector space in advance.
Let Span. To find a basis for , we find a linearly independent subset of . Consider the equation
.
To find the weights for which this equality of functions holds, we use the fact that the we must have equality for every . So we pick four different values for to obtain a linear system that we can solve for the weights. Evaluating both sides of the equation at ,,, and yields the equations
The reduced row echelon form of the coefficient matrix
is . The general solution to this linear system is and . Notice that
or
,
so is a linear combination of the other vectors. The vectors corresponding to the pivot columns are linearly independent, so it follows that ,, and are linearly independent. We conclude that is a basis for and .
in (The polynomials with the property that are called even polynomials.)
Solution.
Let in . Let and suppose that . Since , we must have and . Since , it follows that
or
.
Similarly, the fact that yields the equation
or
.
The reduced row echelon form of the coefficient matrix of system and is , so it follows that . Thus, and so Span. Equating like terms in the equation
yields . We conclude that is linearly independent and is therefore a basis for . Thus, . As an alternative solution, notice that is not in . So and we know that . Since ,, and are in , we can show as above that ,, and are linearly independent. We can conclude that since it cannot be .
Let be the set of all upper triangular matrices. Recall that a matrix is upper triangular if whenever . That is, a matrix is upper triangular if all entries below the diagonal are 0.
Since the zero matrix has all entries equal to 0, it follows that is in . Let and be in , and let . Then when . So is an upper triangular matrix and is closed under addition. Let be a scalar. The th entry of is whenever . So is an upper triangular matrix and is closed under multiplication by scalars. We conclude that is a subspace of .
Find the dimensions of and . Explain. Make a conjecture as to what is in terms of .
Solution.
Let ,, and . We will show that is a basis for . Consider the equation
.
Equating like entries shows that , and so is linearly independent. If is in , then
and so spans . Thus, is a basis for and so . Similarly, for the case let for be the matrix with a 1 in the position and 0 in every other position. Let
.
Equating corresponding entries shows that if
,
then . So is a linearly independent set. If is in , then and spans . We conclude that is a basis for and . In general, for an matrix, the set of matrices , one for each entry on and above the diagonal, is a basis for . There are such matrices for the entries on the diagonal. The number of entries above the diagonal is equal to half the total number of entries () minus half the number of entries on the diagonal (). So there is a total of such matrices for the entries above the diagonal. Therefore,
A finite dimensional vector space is a vector space that can be spanned by a finite set of vectors.
We showed that any two bases for a finite dimensional vector space must contain the same number of vectors. Therefore, we can define the dimension of a finite dimensional vector space to be the number of vectors in any basis for .
If is a vector space with dimension and is any linearly independent subset of with vectors, then is a basis for . Otherwise, we could add vectors to to make a basis for and then would have a basis of more than vectors.
If is a vector space with dimension and is any subset of with vectors that spans , then is a basis for . Otherwise, we could remove vectors from to obtain a basis for and then would have a basis of fewer than vectors.
For any finite dimensional space and a subspace of ,.
Explain why and can be written as linear combinations of the vectors in . Use these linear combinations to find two vectors that are in . Then show that these vectors span a plane in .
Suppose we were to film an experiment involving a ball that is bouncing up and down. Naively, we set up several cameras to follow the process of the experiment from different perspectives and collect the data. All of this data tells us something about the bouncing ball, but there may be no perspective that tells us an important piece of information β the axis along which the ball bounces. The question, then, is how we can extract from the data this important piece of information. Principal Component Analysis (PCA) is a tool for just this type of analysis.
We will use an example to illustrate important concepts we will need. To realistically apply PCA we will have much more data than this, but for now we will restrict ourselves to only two variables so that we can visualize our results. Table 33.7 presents information from ten states on two attributes related to the SAT β Evidence-Based Reading and Writing (EBRW) score and Math score. The SAT is made up of three sections: Reading, Writing and Language (also just called Writing), and Math. The The EBRW score is calculated by combining the Reading and Writing section scores β both the Math and EBRW are scored on a scale of 200-800.
Each attribute (Math, EBRW score) creates a vector whose entries are the student responses for that attribute. The data provides the average scores from participating students in each state. In this example we have two attribute vectors:
that makes up our data set. A plot of the data is shown at left in Figure 33.8, where the EBRW score is along the horizontal axis and the math score is along the vertical axis.
Before analyzing a data set there is often some initial preparation that needs to be made. Issues that have to be dealt with include the problem that attributes might have different units and scales. For example, in a data set with attributes about people, height could be measured in inches while weight is measured in pounds. It is difficult to compare variables when they are on different scales. Another issue to consider is that some attributes are independent of the others (height, for example does not depend on hair color), while some are interrelated (body mass index depends on height and weight). To simplify our work, we will not address these type of problems. The only preparation we will do with our data is to center it.
We can use these sample means to center our data at the origin by translating the data so that each column of our data matrix has mean . We do this by subtracting the mean for that row vector from each component of the vector. Determine the matrix that contains the centered data for our SAT data set from matrix .
A plot of the centered data for our SAT data is shown at right in Figure 33.8. Later we will see why centering the data is useful β it will more easily allow us to project onto subspaces. The goal of PCA is to find a matrix so that , and is suitably transformed to identify the important aspects of the data. We will discuss what the important aspects are shortly. Before we do so, we need to discuss a way to compare the one dimensional data vectors and .
It is often useful to know how spread out a data set is, something the average doesn't tell us. For example, the data sets and both have averages of , but the data in is more spread out. Variance provides one measure of how spread out a one-dimensional data set is. Variance is defined as
var.
The variance provides a measure of how far from the average the data is spread.β60β Determine the variances of the two data vectors and . Which is more spread out?
In general, we will have more than one-dimensional data, as in our SAT data set. It will be helpful to have a way to compare one-dimensional data sets to try to capture the idea of variance for different data sets β how much the data in two different data sets varies from the mean with respect to each other. One such measure is covariance β essentially the average of all corresponding products of deviations from the means. We define the covariance of two data vectors and as
cov.
Determine all covariances
covcovcov and cov.
How are cov and cov related? How are cov and cov related to variances?
What is most important about covariance is its sign. Suppose , and cov. Then if is larger than it is likely that is also larger than . For example, if is a vector that records a persons height from age to and records that same person's weight in the same years, we might expect that when increases so does . Similarly, if cov, then as one data set increases, the other decreases. As an example, if records the number of hours a student spends playing video games each semester ad gives the student's GPA for each semester, then we might expect that decreases as increases. When cov, then and are said to be uncorrelated or independent of each other. For our example and , what does cov tell us about the relationship between and ? Why should we expect this from the context?
The covariance gives us information about the relationships between the attributes. So instead of working with the original data, we work with the covariance data. If we have data vectors ,,, in , the covariance matrix satisfies cov. Calculate the covariance matrix for our SAT data. Then explain why .
Recall that the goal of PCA is to find a matrix such that where transforms the data set to a coordinate system in which the important aspects of the data are revealed. We are now in a position to discuss what that means.
An ideal view of our data would be one in which we can see the direction of greatest variance and one that minimizes redundancy. With redundant data the variables are not independent β that is, covariance is nonzero. So we would like the covariances to all be zero (or as close to zero as possible) to remove redundancy in our data. That means that we would like a covariance matrix in which the non-diagonal entries are all zero. This will be possible if the covariance matrix is diagonalizable.
Consider the covariance matrix . Explain why we can find a matrix with determinant 1 whose columns are unit vectors that diagonalizes . Then find such a matrix. Use technology as appropriate.
For our purposes, we want to diagonalize with , so the matrix that serve our purposes is the one whose rows are the eigenvectors of . To understand why this matrix is the one we want, recall that we want to have , and we want to diagonalize to a diagonal covariance matrix . In this situation we will have (recalling that )
There are two useful ways we can interpret the results of our work so far. An eigenvector of that corresponds to the largest (also called the dominant) eigenvalue is . A plot of the centered data along with the eigenspace of spanned by is shown at left in Figure 33.9. The eigenvector is called the first principal component of . Notice that this line indicates the direction of greatest variation in the data. That is, the sum of the squares of the differences between the data points and the mean is as large as possible in this direction. In other words, when we project the data points onto , as shown at right in Figure 33.9, the variation of the resulting points is larger than it is for any other line. In other words, the data is most spread out in this direction.
There is another way we can interpret this result. If we drop a perpendicular from one of our data points to the space it creates a right triangle with sides of length ,, and as illustrated in the middle of Figure 33.9. Use this idea to explain why maximizing the variation also minimizes the sum of the squares of the distances from the data points to this line. As a result, we have projected our two-dimensional data onto the one-dimensional space that maximizes the variance of the data.
Recall that the matrix (to two decimal places) transforms the data set to a new data set whose covariance matrix is diagonal. Explain how the -axis is related to the transformed data set .
The result of Project Activity 33.9 is that we have reduced the problem from considering the data in a two-dimensional space to a one-dimensional space where the most important aspect of the data is revealed. Of course, we eliminate some of the characteristics of the data, but the most important aspect is still included and highlighted. This is one of the important uses of PCA, data dimension reduction, which allows us to reveal key relationships between the variables that might not be evident when looking at a large dataset.
The second eigenvector of also has meaning. A picture of the eigenspace corresponding to the smaller eigenvector of is shown in Figure 33.11. The second eigenvector of is orthogonal to the first, and the direction of the second eigenvector tells us the direction of the second most amount of variance as can be seen in Figure 33.11.
To summarize, the unit eigenvector for the largest eigenvalue of indicates the direction in which the data has the greatest variance. The direction of the unit eigenvector for the smaller eigenvalue shows the direction in which the data has the second largest variance. This direction is also perpendicular to the first (indicating covariance). The directions of the eigenvectors are called the principal components of . The eigenvector with the highest eigenvalue is the first principal component of the data set and the other eigenvectors are ordered by the eigenvalues, highest to lowest. The principal components provide a new coordinate system from which to view our data β one in which we can see the maximum variance and in which there is zero covariance.
We can use the eigenvalues of to quantify the amount of variance that is accounted for by our projections. Notice that the points along the -axis at right in Figure 33.10 are exactly the numbers in the first row of . These numbers provide the projections of the data in onto the -axis β the axis along which the data has its greatest variance.
Calculate the variance of the data given by the first row of . This is the variance of the data i the direction of the eigenspace . How does the result compare to entries of the covariance matrix for .
The total variance of the data set is the sum of the variances. Explain why the amount of variance in the data that is accounted for in the direction of is
In general, PCA is most useful for larger data sets. The process is the same.
Start with a set of data that forms the rows of an matrix. We center the data by subtracting the mean of each row from the entries of that row to create a centered data set in a matrix .
The principal components of are the eigenvectors of , ordered so that they correspond to the eigenvalues of in decreasing order.
Let be the matrix whose rows are the principal components of , ordered from highest to lowest. Then is suitably transformed to identify the important aspects of the data.
If ,,, are the eigenvalues of in decreasing order, then the amount of variance in the data accounted for by the first principal components is given by
.
The first rows of provide the projection of the data set onto an -dimensional space spanned by the first principal components of .
Let us now consider a problem with more than two variables. We continue to keep the data set small so that we can conveniently operate with it. Table 33.12 presents additional information from ten states on four attributes related to the SAT β Participation rates, Evidence-Based Reading and Writing (EBRW) score, Math score, and average SAT score. Use technology as appropriate for this activity.
How much variation is accounted for in the data by the first principal component? In other words, if we reduce this data to one dimension, how much of the variation do we retain? Explain.
How much variation is accounted for in the data by the first two principal components? In other words, if we reduce this data to two dimensions, how much of the variation do we retain? Explain.
We conclude with a comment. A reasonable question to ask is how we interpret the principal components. Let be an orthogonal matrix such that is the diagonal matrix with the eigenvalues of along the diagonal, in decreasing order. We then have the new perspective from which to view the data. The first principal component (the first row of ) determines the new variable (the first row of ) in the following manner. Let and let represent the columns of so that . Recognizing that
So each is a linear combination of the original variables (contained in the ) with weights from the first principal component. The other new variables are obtained in the same way from the remaining principal components. So even though the principal components may not have an easy interpretation in context, they are connected to the original data in this way. By reducing the data to a few important principal components β that is, visualizing the data in a subspace of small dimension β we can account for almost all of the variation in the data and relate that information back to the original data.
It might seem that we should divide by instead of in the variance, but it is generally accepted to do this for reasons we won't get into. Suffice it to say that if we are using a sample of the entire population, then dividing by provides a variance whose square root is closer to the standard deviation than we would get if we divide by . If we are calculating the variance of an entire population, then we would divide by .