Skip to main content

Section 33 The Dimension of a Vector Space

Subsection Application: Principal Component Analysis

The discipline of statistics is based on the idea of analyzing data. In large data sets it is usually the case that one wants to understand the relationships between the different data in the set. This can be difficult to do when the data set is large and it is impossible to visually examine the data for patterns. Principal Component Analysis (PCA) is a tool for identifying and representing underlying patterns in large data sets, and PCA has been called one of the most important and valuable linear algebra tools for statistical analysis. PCA is used to transform a collection of variables into a (usually smaller) number of uncorrelated variables called principal components. The principal components form the most meaningful basis from which to view data by removing extraneous information and revealing underlying relationships in the data. This presents a framework for how to reduce a complex data set to a lower dimension while retaining the important attributes of the data set. The output helps the experimenter determine which dynamics in the data are important and which can be ignored.

Subsection Introduction

In Section 15 we learned that any two bases for a subspace of Rn contain the same number of vectors. This allowed us to define the dimension of a subspace of Rn. In this section we extend the arguments we made in Section 15 to arbitrary vector spaces and define the dimension of a vector space.

Preview Activity 33.1.

The main tool we used to prove that any two bases for a subspace of Rn must contain the same number of elements was Theorem 15.1. In this preview activity we will show that the same argument can be used for vector spaces. More specifically, we will prove a special case of the following theorem generalizing Theorem 15.1.

Suppose V is a vector space with basis B={v1,v2}. Consider the set U={u1,u2,u3} of vectors in V. We will show that U is linearly dependent using a similar approach to the Preview Activity 15.1.

(a)

What vector equation involving u1,u2,u3 do we need to solve to determine linear independence/dependence of these vectors? Use x1,x2,x3 for coefficients.

(b)

Since B is a basis of V, it spans V. Using this information, rewrite the vectors ui in terms of vj and substitute into the above equation to obtain another equation in terms of vj.

(c)

Since B is a basis of V, the vectors v1,v2 are linearly independent. Using the equation in the previous part, determine what this means about the coefficients x1,x2,x3.

(d)

Express the conditions on x1,x2,x3 in the form of a matrix-vector equation. Explain why there are infinitely many solutions for xi's and why this means the vectors u1,u2,u3 are linearly dependent.

Subsection Finite Dimensional Vector Spaces

Theorem 33.1 shows that if sets B1 and B2 are finite bases for a vector space V, which are linearly independent by definition, then each cannot contain more elements than the other, so the number of elements in each basis must be equal.

Theorem 33.2 states that if a vector space V has a basis with a finite number of vectors, then the number of vectors in a basis for that vector space is a well-defined number. In other words, the number of vectors in a basis is an invariant of the vector space. This important number is given a name.

Definition 33.3.

A finite-dimensional vector space is a vector space that can be spanned by a finite number of vectors. The dimension of a non-trivial finite-dimensional vector space is the number of vectors in a basis for V. The dimension of the trivial vector space is defined to be 0.

We denote the dimension of a finite dimensional vector space V by dim⁑(V).

Not every vector space is finite dimensional. We have seen, for example, that the vector space P of all polynomials, regardless of degree, is not a finite-dimensional vector space. In fact, the polynomials

1,t,t2,…,tn,…

are linearly independent, so P has an infinite linearly independent set and therefore has no finite basis. A vector space that has an infinite basis is called an infinite dimensional vector space.

Activity 33.2.

Since columns of the nΓ—n identity matrix span Rn and are linearly independent, the columns of In form a basis for Rn (the standard basis). Consequently, we have that dim⁑(Rn)=n. In this activity we determine the dimensions of other familiar vector spaces. Find the dimensions of each of the indicated vector spaces. Verify your answers.

Finding the dimension of a finite-dimensional vector space amounts to finding a basis for the space.

Activity 33.3.

Let W={(a+b)+(aβˆ’b)t+(2a+3b)t2∣a,b are scalars }.

(a)

Find a finite set of polynomials in W that span W.

(b)

Determine if the spanning set from part (a) is linearly independent or dependent. Clearly explain your process.

(c)

What is dim⁑(W)? Explain.

Subsection The Dimension of a Subspace

Every subspace of a finite-dimensional vector space is a vector space, and since a subspace is contained in a vector space it is natural to think that the dimension of a subspace should be less than or equal to the dimension of the larger vector space. We verify that fact in this section.

Activity 33.4.

Let V be a finite dimensional vector space of dimension n and let W be a subspace of V. Explain why W cannot have dimension larger than dim⁑(V), and if Wβ‰ V then dim⁑(W)<dim⁑(V).

Subsection Conditions for a Basis of a Vector Space

There are two items we need to confirm before we can state that a subset B of a subspace W of a vector space is a basis for W: the set B must be linearly independent and span W. We can reduce the amount of work it takes to show that a set is a basis if we know the dimension of the vector space in advance.

Activity 33.5.

Let W be a subspace of a vector space V with dim⁑(W)=k. We know that every basis of W contains exactly k vectors.

(a)

Suppose that S is a subset of W that contains k vectors and is linearly independent. In this part of the activity we will show that S must span W.

(i)

Suppose that S does not span W. Explain why this implies that W contains a set of k+1 linearly independent vectors.

(ii)

Explain why the result of i tells us that S is a basis for W.

(b)

Now suppose that S is a subset of W with k vectors that spans W. In this part of the activity we will show that S must be linearly independent.

(i)

Suppose that S is not linearly independent. Explain why we can then find a proper subset of S that is linearly independent but has the same span as S.

(ii)

Explain why the result of i tells us that S is a basis for W.

The result of Activity 33.5 is summarized in the following theorem (compare to Theorem 15.4).

Subsection Examples

What follows are worked examples that use the concepts from this section.

Example 33.5.

Find a basis and dimension for each of the indicated subspaces of the given vector spaces.

(a)

{a+b(t+t2):a,b∈R} in P2

Solution.

Let W={a+b(t+t2):a,b∈R}. Every element in W has the form

a+b(t+t2)=a(1)+b(t+t2).

So W=Span{1,t+t2}. Since neither 1 nor t+t2 is a scalar multiple of the other, the set {1,t+t2} is linearly independent. Thus, {1,t+t2} is a basis for W and dim⁑(W)=2.

(b)

Span{1,11+x2,2+x21+x2,arctan⁑(x)} in F

Solution.

Let W=Span{1,11+x2,2+x21+x2,arctan⁑(x)}. To find a basis for W, we find a linearly independent subset of {1,11+x2,2+x21+x2,arctan⁑(x)}. Consider the equation

c1(1)+c2(11+x2)+c3(2+x21+x2)+c4arctan⁑(x)=0.

To find the weights ci for which this equality of functions holds, we use the fact that the we must have equality for every x. So we pick four different values for x to obtain a linear system that we can solve for the weights. Evaluating both sides of the equation at x=0, x=1, x=βˆ’1, and x=2 yields the equations

c1+c2+2c3= 0c1+12c2+32c3+Ο€4c4= 0c1+12c2+32c3βˆ’Ο€4c4= 0c1+15c2+65c3+arctan⁑(2)c4= 0.

The reduced row echelon form of the coefficient matrix

[112011232Ο€411232βˆ’Ο€411565arctan⁑(2)]

is [1010011000010000]. The general solution to this linear system is c1=c2=βˆ’c3 and c4=0. Notice that

(βˆ’1)(1)+(βˆ’1)(11+x2)+2+x21+x2=0

or

2+x21+x2=1+(11+x2),

so 2+x21+x2 is a linear combination of the other vectors. The vectors corresponding to the pivot columns are linearly independent, so it follows that 1, 11+x2, and arctan⁑(x) are linearly independent. We conclude that {1,11+x2,arctan⁑(x)} is a basis for W and dim⁑(W)=3.

(c)

{p(t)∈Pn:p(βˆ’t)=p(t)} in P4 (The polynomials with the property that p(βˆ’t)=p(t) are called even polynomials.)

Solution.

Let W={p(t)∈Pn:p(βˆ’t)=p(t)} in P4. Let p(t)∈W and suppose that p(t)=a0+a1t+a2t2+a3t3+a4t4. Since p(βˆ’t)=p(t), we must have p(1)=p(βˆ’1) and p(3)=p(βˆ’3). Since p(1)=p(βˆ’1), it follows that

a0+a1+a2+a3+a4=a0βˆ’a1+a2βˆ’a3+a4

or

a1+a3=0.

Similarly, the fact that p(3)=p(βˆ’3) yields the equation

a0+3a1+9a2+27a3+81a4=a0βˆ’3a1+9a2βˆ’27a3+81a4

or

a1+9a3=0.

The reduced row echelon form of the coefficient matrix of system a1+a3=0 and a1+9a3=0 is I2, so it follows that a1=a3=0. Thus, p(t)=a0+a2t2+a4t4 and so W=Span{1,t2,t4}. Equating like terms in the equation

c1(1)+c2(t2)+c3(t4)=0

yields c1=c2=c3=0. We conclude that {1,t2,t4} is linearly independent and is therefore a basis for W. Thus, dim⁑(W)=3. As an alternative solution, notice that p(t)=t is not in W. So Wβ‰ V and we know that dim⁑(W)<dim⁑(V). Since 1, t2, and t4 are in W, we can show as above that 1, t2, and t4 are linearly independent. We can conclude that dim⁑(W)=3 since it cannot be 4.

Example 33.6.

Let UnΓ—n be the set of all nΓ—n upper triangular matrices. Recall that a matrix A=[aij] is upper triangular if aij=0 whenever i>j. That is, a matrix is upper triangular if all entries below the diagonal are 0.

(a)

Show that UnΓ—n is a subspace of MnΓ—n.

Solution.

Since the nΓ—n zero matrix 0nΓ—n has all entries equal to 0, it follows that 0nΓ—n is in UnΓ—n. Let A=[aij] and B=[bij] be in UnΓ—n, and let C=[cij]=A+B. Then cij=aij+bij=0+0 when i>j. So C is an upper triangular matrix and UnΓ—n is closed under addition. Let c be a scalar. The ijth entry of cA is caij=c(0)=0 whenever i>j. So cA is an upper triangular matrix and UnΓ—n is closed under multiplication by scalars. We conclude that UnΓ—n is a subspace of MnΓ—n.

(b)

Find the dimensions of U2Γ—2 and U3Γ—3. Explain. Make a conjecture as to what dim⁑(UnΓ—n) is in terms of n.

Solution.

Let M11=[1000], M12=[0100], and M22=[0001]. We will show that S={M11,M12,M22} is a basis for U2Γ—2. Consider the equation

x1M11+x2M12+x3M22=0.

Equating like entries shows that x1=x2=x3=0, and so S is linearly independent. If [ab0c] is in U2Γ—2, then

[ab0c]=aM11+bM12+cM22

and so S spans U2Γ—2. Thus, S is a basis for U2Γ—2 and so dim⁑(U2Γ—2)=3. Similarly, for the 3Γ—3 case let Mij for i≀j be the 3Γ—3 matrix with a 1 in the ij position and 0 in every other position. Let

S={M11,M12,M13,M22,M23,M33}.

Equating corresponding entries shows that if

x1M11+x2M12+x3M13+x4M22+x5M23+x6M33=0,

then x1=x2=x3=x4=x5=x6=0. So S is a linearly independent set. If A=[aij] is in U3Γ—3, then A=βˆ‘iβ‰₯jaijMij and S spans U3Γ—3. We conclude that S is a basis for U3Γ—3 and dim⁑(U3Γ—3)=6. In general, for an nΓ—n matrix, the set of matrices Mij, one for each entry on and above the diagonal, is a basis for UnΓ—n. There are n such matrices for the entries on the diagonal. The number of entries above the diagonal is equal to half the total number of entries (n2) minus half the number of entries on the diagonal (n). So there is a total of n2βˆ’n2 such matrices for the entries above the diagonal. Therefore,

dim⁑(UnΓ—n)=n+n2βˆ’n2=n2+n2.

Subsection Summary

  • A finite dimensional vector space is a vector space that can be spanned by a finite set of vectors.

  • We showed that any two bases for a finite dimensional vector space must contain the same number of vectors. Therefore, we can define the dimension of a finite dimensional vector space V to be the number of vectors in any basis for V.

  • If V is a vector space with dimension n and S is any linearly independent subset of V with n vectors, then S is a basis for V. Otherwise, we could add vectors to S to make a basis for V and then V would have a basis of more than n vectors.

  • If V is a vector space with dimension n and S is any subset of V with n vectors that spans V, then S is a basis for V. Otherwise, we could remove vectors from S to obtain a basis for V and then V would have a basis of fewer than n vectors.

  • For any finite dimensional space V and a subspace W of V, dim⁑(W)≀dim⁑(V).

Exercises Exercises

1.

Let W=Span{1+t2,2+t+2t2+t3,1+t+t3,tβˆ’t2+t3} in P3. Find a basis for W. What is the dimension of W?

2.

Let A=[1210] and B=[101βˆ’1].

(a)

Are A and B linearly independent or dependent? Verify your result.

(b)

Extend the set S={A,B} to a basis for M2Γ—2. That is, find a basis for M2Γ—2 that contains both A and B.

3.

Let A=[120302], B=[βˆ’111240], C=[51βˆ’30βˆ’124], D=[54βˆ’25βˆ’86], and E=[200βˆ’201] in M2Γ—3 and let S={A,B,C,D,E}.

(a)

Is S a basis for M2Γ—3? Explain.

(b)

Determine if S is a linearly independent or dependent set. Verify your result.

(c)

Find a basis B for Span S that is a subset of S and write all of the vectors in S as linear combinations of the vectors in B.

(d)

Extend your basis B from part (c) to a basis M2Γ—3. Explain your method.

4.

Determine the dimension of each of the following vector spaces.

(b)

The space of all polynomials in P3 whose constant terms is 0.

(d)

Span{[1 2 0 1 βˆ’1]T,[0 1 1 1 0]T,[0 3 2 3 0]T,[βˆ’1 0 1 1 1]T} in R5

5.

Let W be the set of matrices in M2Γ—2 whose diagonal entries sum to 0. Show that W is a subspace of M2Γ—2, find a basis for W, and then find dim⁑(W).

6.

Show that if W is a subspace of a finite dimensional vector space V, then any basis of W can be extended to a basis of V.

7.

Let W be the set of all polynomials a+bt+ct2 in P2 such that a+b+c=0. Show that W is a subspace of P2, find a basis for W, and then find dim⁑(W).

8.

Suppose W1,W2 are subspaces in a finite-dimensional space V.

(a)

Show that it is not true in general that dim⁑(W1)+dim⁑(W2)≀dim⁑(V).

(b)

Are there any conditions on W1 and W2 that will ensure that dim⁑(W1)+dim⁑(W2)≀dim⁑(V)?

Hint.

See problem 12 in the previous section.

9.

Suppose W1βŠ†W2 are two subspaces of a finite-dimensional space. Show that if dim⁑(W1)=dim⁑(W2), then W1=W2.

Hint.

Consider dimensions.

10.

Suppose W1,W2 are both three-dimensional subspaces inside R4. In this exercise we will show that W1∩W2 contains a plane. Let {u1,u2,u3} be a basis for W1 and let {v1,v2,v3} be a basis for W2.

(a)

If v1, v2, and v3 are all in W1, explain why W1∩W2 must contain a plane.

(b)

Now we consider the case where not all of v1, v2, and v3 are in W1. Since the arguments will be the same, let us assume that v1 is not in W1.

(i)

Explain why the set S={u1,u2,u3,v1} is a basis for R4.

(ii)

Explain why v2 and v3 can be written as linear combinations of the vectors in S. Use these linear combinations to find two vectors that are in W1∩W2. Then show that these vectors span a plane in W1∩W2.

11.

A magic matrix is an nΓ—n matrix in which the sum of the entries along any row, column, or diagonal is the same. For example, the 3Γ—3 matrix

[540βˆ’238621]

is a magic matrix. Note that the entries of a magic matrix can be any real numbers.

(a)

Show that the set of nΓ—n magic matrices is a subspace of MnΓ—n.

Hint.

Consider the row, column, and diagonal sums.

(b)

Let V be the space of 3Γ—3 magic matrices. Find a basis for V and determine the dimension of V.

12.

Label each of the following statements as True or False. Provide justification for your response.

(a) True/False.

The dimension of a finite dimensional vector space is the minimum number of vectors needed to span that space.

(b) True/False.

The dimension of a finite dimensional vector space is the maximum number of linearly independent vectors that can exist in that space.

(c) True/False.

If n vectors span an n-dimensional vector space V, then these vectors form a basis of V.

(d) True/False.

Any set of n vectors form a basis in an n-dimensional vector space.

(e) True/False.

Every vector in a vector space V spans a one-dimensional subspace of V.

(f) True/False.

Any set of n linearly independent vectors in a vector space V of dimensional n is a basis for V.

(g) True/False.

If {v1,v2,…,vk} is linearly independent in V, then dim⁑(V)β‰₯k.

(h) True/False.

If a set of k vectors span V, then any set of more than k vectors in V is linearly dependent.

(i) True/False.

If an infinite set of vectors span V, then V is infinite-dimensional.

(j) True/False.

If W1,W2 are both two-dimensional subspaces of a three dimensional vector space V, then W1∩W2β‰ {0}.

(k) True/False.

If dim⁑(V)=n and W is a subspace of V with dimension n, then W=V.

Subsection Project: Understanding Principal Component Analysis

Suppose we were to film an experiment involving a ball that is bouncing up and down. Naively, we set up several cameras to follow the process of the experiment from different perspectives and collect the data. All of this data tells us something about the bouncing ball, but there may be no perspective that tells us an important piece of information β€” the axis along which the ball bounces. The question, then, is how we can extract from the data this important piece of information. Principal Component Analysis (PCA) is a tool for just this type of analysis.

Table 33.7. SAT data
State 1 2 3 4 5 6 7 8 9 10
EBRW 595 540 522 508 565 512 643 574 534 539
Math 571 536 493 493 554 501 655 566 534 539

We will use an example to illustrate important concepts we will need. To realistically apply PCA we will have much more data than this, but for now we will restrict ourselves to only two variables so that we can visualize our results. Table 33.7 presents information from ten states on two attributes related to the SAT β€” Evidence-Based Reading and Writing (EBRW) score and Math score. The SAT is made up of three sections: Reading, Writing and Language (also just called Writing), and Math. The The EBRW score is calculated by combining the Reading and Writing section scores β€” both the Math and EBRW are scored on a scale of 200-800.

Each attribute (Math, EBRW score) creates a vector whose entries are the student responses for that attribute. The data provides the average scores from participating students in each state. In this example we have two attribute vectors:

x1=[595 540 522 508 565 512 643 574 534 539]T and x2=[571 536 493 493 554 501 655 566 534 539]T.

These vectors form the rows of a 2Γ—10 matrix

X0=[x1Tx2T]=[595540522508565512643574534539571536493493554501655566534539]

that makes up our data set. A plot of the data is shown at left in Figure 33.8, where the EBRW score is along the horizontal axis and the math score is along the vertical axis.

Figure 33.8. Two views of the data set (EBRW horizontal, math vertical).

The question we want to answer is, how do we represent our data set so that the most important features in the data set are revealed?

Project Activity 33.6.

Before analyzing a data set there is often some initial preparation that needs to be made. Issues that have to be dealt with include the problem that attributes might have different units and scales. For example, in a data set with attributes about people, height could be measured in inches while weight is measured in pounds. It is difficult to compare variables when they are on different scales. Another issue to consider is that some attributes are independent of the others (height, for example does not depend on hair color), while some are interrelated (body mass index depends on height and weight). To simplify our work, we will not address these type of problems. The only preparation we will do with our data is to center it.

(a)

An important piece of information about a one-dimensional data set x=[x1 x2 x3 β‹― xn]T is the sample average or mean

x―=βˆ‘i=1nxi.

Calculate the means x1― and x2― for our SAT data from the matrix X0.

(b)

We can use these sample means to center our data at the origin by translating the data so that each column of our data matrix has mean 0. We do this by subtracting the mean for that row vector from each component of the vector. Determine the matrix X that contains the centered data for our SAT data set from matrix X0.

A plot of the centered data for our SAT data is shown at right in Figure 33.8. Later we will see why centering the data is useful β€” it will more easily allow us to project onto subspaces. The goal of PCA is to find a matrix P so that PX=Y, and Y is suitably transformed to identify the important aspects of the data. We will discuss what the important aspects are shortly. Before we do so, we need to discuss a way to compare the one dimensional data vectors x1 and x2.

Project Activity 33.7.

To compare the two one dimensional data vectors, we need to consider variance and covariance.

(a)

It is often useful to know how spread out a data set is, something the average doesn't tell us. For example, the data sets [1 2 3]T and [βˆ’2 0 8]T both have averages of 2, but the data in [βˆ’2 0 8]T is more spread out. Variance provides one measure of how spread out a one-dimensional data set x=[x1 x2 x3 β‹― xn]T is. Variance is defined as

var(x)=1nβˆ’1βˆ‘i=1n(xiβˆ’x―)2.

The variance provides a measure of how far from the average the data is spread. 60  Determine the variances of the two data vectors x1 and x2. Which is more spread out?

(b)

In general, we will have more than one-dimensional data, as in our SAT data set. It will be helpful to have a way to compare one-dimensional data sets to try to capture the idea of variance for different data sets β€” how much the data in two different data sets varies from the mean with respect to each other. One such measure is covariance β€” essentially the average of all corresponding products of deviations from the means. We define the covariance of two data vectors x=[x1 x2 β‹― xn]T and y=[y1 y2 β‹― yn]T as

cov(x,y)=1nβˆ’1βˆ‘i=1n(xiβˆ’x―)(yiβˆ’y―).

Determine all covariances

cov(x1,x1), cov(x1,x2), cov(x2,x1),  and  cov(x2,x2).

How are cov(x1,x2) and cov(x2,x1) related? How are cov(x1,x1) and cov(x2,x2) related to variances?

(c)

What is most important about covariance is its sign. Suppose y=[y1 y2 β€¦ yn]T, z=[z1 z2 β€¦ zn]T and cov(y,z)>0. Then if yi is larger than yj it is likely that zi is also larger than zj. For example, if y is a vector that records a persons height from age 2 to 10 and z records that same person's weight in the same years, we might expect that when yi increases so does zi. Similarly, if cov(y,z)<0, then as one data set increases, the other decreases. As an example, if y records the number of hours a student spends playing video games each semester ad z gives the student's GPA for each semester, then we might expect that zi decreases as yi increases. When cov(y,z)=0, then y and z are said to be uncorrelated or independent of each other. For our example x1 and x2, what does cov(x1,x2) tell us about the relationship between x1 and x2? Why should we expect this from the context?

(d)

The covariance gives us information about the relationships between the attributes. So instead of working with the original data, we work with the covariance data. If we have m data vectors x1, x2, …, xm in Rn, the covariance matrix C=[cij] satisfies cij=cov(xi,xj). Calculate the covariance matrix for our SAT data. Then explain why C=1nβˆ’1XXT.

Recall that the goal of PCA is to find a matrix P such that PX=Y where P transforms the data set to a coordinate system in which the important aspects of the data are revealed. We are now in a position to discuss what that means.

An ideal view of our data would be one in which we can see the direction of greatest variance and one that minimizes redundancy. With redundant data the variables are not independent β€” that is, covariance is nonzero. So we would like the covariances to all be zero (or as close to zero as possible) to remove redundancy in our data. That means that we would like a covariance matrix in which the non-diagonal entries are all zero. This will be possible if the covariance matrix is diagonalizable.

Project Activity 33.8.

Consider the covariance matrix C=[1760.181967.621967.622319.29]. Explain why we can find a matrix P with determinant 1 whose columns are unit vectors that diagonalizes C. Then find such a matrix. Use technology as appropriate.

For our purposes, we want to diagonalize XXT with PXXTPβˆ’1, so the matrix P that serve our purposes is the one whose rows are the eigenvectors of XXT. To understand why this matrix is the one we want, recall that we want to have PX=Y, and we want to diagonalize XXT to a diagonal covariance matrix YYT. In this situation we will have (recalling that Pβˆ’1=PT)

1nβˆ’1YYT=1nβˆ’1(PX)(PX)T=1nβˆ’1P(XXT)PT=P(XXT)Pβˆ’1.

So the matrix P that we want is exactly the one that diagonalizes XXT.

Project Activity 33.9.

There are two useful ways we can interpret the results of our work so far. An eigenvector of XXT that corresponds to the largest (also called the dominant) eigenvalue Ξ»1 is [0.66 0.76]T. A plot of the centered data along with the eigenspace EΞ»1 of XXT spanned by v1=[0.66 0.76]T is shown at left in Figure 33.9. The eigenvector v1 is called the first principal component of X. Notice that this line EΞ»1 indicates the direction of greatest variation in the data. That is, the sum of the squares of the differences between the data points and the mean is as large as possible in this direction. In other words, when we project the data points onto EΞ»1, as shown at right in Figure 33.9, the variation of the resulting points is larger than it is for any other line. In other words, the data is most spread out in this direction.

(a)

There is another way we can interpret this result. If we drop a perpendicular from one of our data points to the space EΞ»1 it creates a right triangle with sides of length a, b, and c as illustrated in the middle of Figure 33.9. Use this idea to explain why maximizing the variation also minimizes the sum of the squares of the distances from the data points to this line. As a result, we have projected our two-dimensional data onto the one-dimensional space that maximizes the variance of the data.

Figure 33.9. The principal component.
(b)

Recall that the matrix (to two decimal places) P=[βˆ’0.66βˆ’0.760.760.66] transforms the data set X to a new data set Y=PX whose covariance matrix is diagonal. Explain how the x-axis is related to the transformed data set Y.

Figure 33.10. Applying P.

The result of Project Activity 33.9 is that we have reduced the problem from considering the data in a two-dimensional space to a one-dimensional space EΞ»1 where the most important aspect of the data is revealed. Of course, we eliminate some of the characteristics of the data, but the most important aspect is still included and highlighted. This is one of the important uses of PCA, data dimension reduction, which allows us to reveal key relationships between the variables that might not be evident when looking at a large dataset.

The second eigenvector of XXT also has meaning. A picture of the eigenspace EΞ»2 corresponding to the smaller eigenvector Ξ»2 of XXT is shown in Figure 33.11. The second eigenvector of XXT is orthogonal to the first, and the direction of the second eigenvector tells us the direction of the second most amount of variance as can be seen in Figure 33.11.

Figure 33.11. The second principal component.

To summarize, the unit eigenvector for the largest eigenvalue of XXT indicates the direction in which the data has the greatest variance. The direction of the unit eigenvector for the smaller eigenvalue shows the direction in which the data has the second largest variance. This direction is also perpendicular to the first (indicating 0 covariance). The directions of the eigenvectors are called the principal components of X. The eigenvector with the highest eigenvalue is the first principal component of the data set and the other eigenvectors are ordered by the eigenvalues, highest to lowest. The principal components provide a new coordinate system from which to view our data β€” one in which we can see the maximum variance and in which there is zero covariance.

Project Activity 33.10.

We can use the eigenvalues of XXT to quantify the amount of variance that is accounted for by our projections. Notice that the points along the x-axis at right in Figure 33.10 are exactly the numbers in the first row of Y. These numbers provide the projections of the data in Y onto the x-axis β€” the axis along which the data has its greatest variance.

(a)

Calculate the variance of the data given by the first row of Y. This is the variance of the data i the direction of the eigenspace EΞ»1. How does the result compare to entries of the covariance matrix for Y.

(b)

Repeat part (a) for the data along the second row of Y.

(c)

The total variance of the data set is the sum of the variances. Explain why the amount of variance in the data that is accounted for in the direction of EΞ»1 is

Ξ»1Ξ»1+Ξ»2.

Then calculate this amount for the SAT data.

In general, PCA is most useful for larger data sets. The process is the same.

  • Start with a set of data that forms the rows of an mΓ—n matrix. We center the data by subtracting the mean of each row from the entries of that row to create a centered data set in a matrix X.

  • The principal components of X are the eigenvectors of XXT, ordered so that they correspond to the eigenvalues of XXT in decreasing order.

  • Let P be the matrix whose rows are the principal components of X, ordered from highest to lowest. Then Y=PX is suitably transformed to identify the important aspects of the data.

  • If Ξ»1, Ξ»2, …, Ξ»n are the eigenvalues of XXT in decreasing order, then the amount of variance in the data accounted for by the first r principal components is given by

    Ξ»1+Ξ»2+β‹―+Ξ»rΞ»1+Ξ»2+β‹―+Ξ»n.
  • The first r rows of Y=PX provide the projection of the data set X onto an r-dimensional space spanned by the first r principal components of X.

Project Activity 33.11.

Let us now consider a problem with more than two variables. We continue to keep the data set small so that we can conveniently operate with it. Table 33.12 presents additional information from ten states on four attributes related to the SAT β€” Participation rates, Evidence-Based Reading and Writing (EBRW) score, Math score, and average SAT score. Use technology as appropriate for this activity.

Table 33.12. SAT data
State 1 2 3 4 5 6 7 8 9 10
Rate 6 60 97 100 64 99 4 23 79 70
EBRW 595 540 522 508 565 512 643 574 534 539
Math 571 536 493 493 554 501 655 566 534 539
SAT 1166 1076 1014 1001 1120 1013 1296 1140 1068 1086
(a)

Determine the centered data matrix X for this data set.

(b)

Find the covariance matrix for this data set. Round to four decimal places.

(c)

Find the principal components of X. Include at least four decimal places accuracy.

(d)

How much variation is accounted for in the data by the first principal component? In other words, if we reduce this data to one dimension, how much of the variation do we retain? Explain.

(e)

How much variation is accounted for in the data by the first two principal components? In other words, if we reduce this data to two dimensions, how much of the variation do we retain? Explain.

We conclude with a comment. A reasonable question to ask is how we interpret the principal components. Let P be an orthogonal matrix such that PCPT is the diagonal matrix with the eigenvalues of C along the diagonal, in decreasing order. We then have the new perspective Y=PX from which to view the data. The first principal component p1 (the first row of P) determines the new variable y1=[yi] (the first row of Y) in the following manner. Let p1=[pi]T and let ci represent the columns of X so that X=[c1 c2 c3 β‹― c20]. Recognizing that

PX=[p1Tp2Tp3Tp4Tp5Tp6T][c1 c2 c3 β‹― c20],

we have that

yi=p1Tci.

That is,

y1=[p1Tc1 p1Tc2 p1Tc3 β‹― p1Tc20].

So each yi is a linear combination of the original variables (contained in the ci) with weights from the first principal component. The other new variables are obtained in the same way from the remaining principal components. So even though the principal components may not have an easy interpretation in context, they are connected to the original data in this way. By reducing the data to a few important principal components β€” that is, visualizing the data in a subspace of small dimension β€” we can account for almost all of the variation in the data and relate that information back to the original data.

It might seem that we should divide by n instead of nβˆ’1 in the variance, but it is generally accepted to do this for reasons we won't get into. Suffice it to say that if we are using a sample of the entire population, then dividing by nβˆ’1 provides a variance whose square root is closer to the standard deviation than we would get if we divide by n. If we are calculating the variance of an entire population, then we would divide by n.