By the end of this section, you should be able to give precise and thorough answers to the questions listed below. You may want to keep these questions in mind to focus your thoughts as you complete the section.
What is the operator norm of a matrix and what does it tell us about the matrix?
What is a singular value decomposition of a matrix? Why is a singular value decomposition important?
How does a singular value decomposition relate fundamental subspaces connected to a matrix?
What is an outer product decomposition of a matrix and how is it useful?
Effective search engines search for more than just words. Language is complex and search engines must deal with the fact that there are often many ways to express a given concept (this is called synonymy, that multiple words can have the same meaning), and that a single word can have multiple meanings (polysemy). As a consequence, a search on a word may provide irrelevant matches (e.g., searching for derivative could provide pages on mathematics or financial securities) or you might search for articles on cats but the paper you really want uses the word felines. A better search engine will not necessarily try to match terms, but instead retrieve information based on concept or intent. Latent Semantic Indexing (LSI) (or Latent Semantic Analysis), developed in the late 1980s, helps search engines determine concept and intent in order to provide more accurate and relevant results. LSI essentially works by providing underlying (latent) relationships between words (semantics) that search engines need to provide context and understanding (indexing). LSI provides a mapping of both words and documents into a lower dimensional “concept” space, and makes the search in this new space. The mapping is provided by the singular value decomposition.
The singular value decomposition (SVD) of a matrix is an important and useful matrix decomposition. Unlike other matrix decompositions, every matrix has a singular value decomposition. The SVD is used in a variety of applications including scientific computing, digital signal processing, image compression, principal component analysis, web searching through latent semantic indexing, and seismology. Recall that the eigenvector decomposition of an diagonalizable matrix has the form , where the columns of the matrix are linearly independent eigenvectors of and the diagonal entries of the diagonal matrix are the eigenvalues of . The singular value decomposition does something similar for any matrix of any size. One of the keys to the SVD is that the matrix is symmetric for any matrix .
Before we introduce the Singular Value Decomposition, let us work through some preliminaries to motivate the idea. The first is to provide an answer to the question “How ‘big’ is a matrix?” There are many ways to interpret and answer this question, but a substantial (and useful) answer should involve more than just the dimensions of the matrix. A good measure of the size of a matrix, which we will refer to as the norm of the matrix, should take into account the action of the linear transformation defined by the matrix on vectors. This then will lead to questions about how difficult or easy is it to solve a matrix equation .
If we want to incorporate the action of a matrix into a calculation of the norm of , we might think of measuring how much can change a vector . This could lead us to using as some sort of measure of a norm of . However, since for any scalar , scaling by a large scalar will produce a large norm, so this is not a viable definition of a norm. We could instead measure the relative effect that has on a vector as , since this ratio does not change when is multiplied by a scalar. The largest of all of these ratios would provide a good sense of how much can change vectors. Thus, we define the operator norm of a matrix as follows.
Due to the linearity of matrix multiplication, we can restrict ourselves to unit vectors for an equivalent definition of the operator norm of the matrix as
The operator norm of a matrix tells us that how big the action of an matrix is can be determined by its action on the unit sphere in (the unit sphere is the set of terminal point of unit vectors). Let us consider two examples.
Let . We can draw a graph to see the action of on the unit circle. A picture of the set is shown in Figure 29.3.
Figure29.3.The image of the unit circle under the action of .
It appears that transforms the unit circle into an ellipse. To find we want to maximize for on the unit circle. This is the same as maximizing
.
Now is a symmetric matrix, so we can orthogonally diagonalize . The eigenvalues of are 32 and 2. Let , where is a unit eigenvector of with eigenvalue 32 and is a unit eigenvector of with eigenvalue 2. Then is an orthogonal matrix such that . It follows that
.
Now is orthogonal, so and maps the unit circle to the unit circle. Moreover, if is on the unit circle, then is also on the unit circle and . So every point on the unit circle corresponds to a point on the unit circle. Thus, the forms and take on exactly the same values over all points on the unit circle. Now we just need to find the maximum value of . This turns out to be relatively easy since is a diagonal matrix.
Let's simplify the notation. Let . Then our job is to maximize . If , then
.
We want to find the maximum value of this expression for on the unit circle. Note that and so
.
Since is on the unit circle, the expression attains the value 32 at some point on the unit circle, so 32 is the maximum value of over all on the unit circle. While we are at it, we can similarly find the minimum value of for on the unit circle. Since we see that
.
Since the expression attains the value 2 at on the unit circle, we can see that attains the minimum value of 2 on the unit circle.
Now we can return to the expression . Since assumes the same values as , we can say that the maximum value of for on the unit circle is 32 (and the minimum value is 2). Moreover, the quadratic form assumes its maximum value when or . Thus, the form assumes its maximum value at the vector or . Similarly, the quadratic form attains its minimum value at or . We conclude that .
Figure 29.4 shows the image of the unit circle under the action of and the images of and where are the two unit eigenvectors of . The image also supports that assumes its maximum and minimum values for points on the unit circle at and .
Figure29.4.The image of the unit circle under the action of , and the vectors and
What we have just argued is that the maximum value of for on the unit sphere in is the square root of the largest eigenvalue of and occurs at a corresponding unit eigenvector.
This same process works for matrices other than ones. For example, consider . In this case maps to . The image of the unit sphere under left multiplication by is a filled ellipse as shown in Figure 29.6.
Figure29.6.The image of the unit circle under the action of , and the vectors and
As with the previous example, the norm of is the square root of the maximum value of and this maximum value is the dominant eigenvalue of . The eigenvalues of are ,, and with corresponding unit eigenvectors ,, and . So in this case we have . The transformation defined by matrix multiplication by from to has a one-dimensional kernel which is spanned by the eigenvector corresponding to . The image of the transformation is 2-dimensional and the image of the unit circle is an ellipse where gives the major axis of the ellipse and gives the minor axis. Essentially, the square roots of the eigenvalues of tell us how stretches the image space in each direction.
We have just argued that the image of the unit -sphere under the action of an matrix is an ellipsoid in stretched the greatest amount, , in the direction of an eigenvector for the largest eigenvalue () of ; the next greatest amount, , in the direction of a unit vector for the second largest eigenvalue () of ; and so on.
The Singular Value Decomposition (SVD) is essentially a concise statement of what we saw in the previous section that works for any matrix. We will uncover the SVD in this section.
Preview Activity 29.3 contains the basic ideas behind the Singular Value Decomposition. Let be an matrix with real entries. Note that is a symmetric matrix and, hence, it can be orthogonally diagonalized. Let be an orthogonal matrix whose columns form an orthonormal set of eigenvectors for . For each , let . We know
if . So the set is an orthogonal set in . Each of the vectors is in Col , and so is an orthogonal subset of Col . It is possible that for some of the (if has 0 as an eigenvalue). Let ,,, be the eigenvectors corresponding to the nonzero eigenvalues. Then the set
is a linearly independent set of nonzero orthogonal vectors in Col . Now we will show that is a basis for Col . Let be a vector in Col . Then for some vector in . Recall that the vectors ,,, form an orthonormal basis of , so
Let be an matrix of rank . There exist an orthogonal matrix , an orthogonal matrix , and an matrix whose first diagonal entries are the singular values ,,, and whose other entries are 0, such that
A Singular Value Decomposition of an matrix of rank can be found as follows.
Find an orthonormal basis of eigenvectors of such that for from 1 to with with the first eigenvalues being positive. The vectors ,,,, are the right singular vectors of .
Let
.
Then orthogonally diagonalizes .
The singular values of are the numbers , where for from 1 to . Let be the matrix
For from 1 to , let . Then the set forms an orthonormal basis of Col .
Extend the set to an orthonormal basis
of . Let
.
The vectors ,,, are the left singular vectors of .
This is called an outer product decomposition of and tells us everything we learned above about the action of the matrix as a linear transformation. Each of the products is a rank 1 matrix (see Exercise 9), and is the largest value takes on the unit -sphere, is the next largest dilation of the unit -sphere, and so on. An outer product decomposition allows us to approximate with smaller rank matrices. For example, the matrix is the best rank 1 approximation to , is the best rank 2 approximation, and so on. This will be very useful in applications, as we will see in the next section.
We conclude this section with a short discussion of how a singular value decomposition relates fundamental subspaces of a matrix. We have seen that the vectors ,,, in an SVD for an matrix form a basis for Col . Recall also that for . Since Nul Col , it follows that the vectors ,,, form a basis for Nul . As you will show in the exercises, the set is a basis for Row . Thus, an SVD for a matrix tells us about three fundamental vector spaces related to .
Find a singular value decomposition for . You may use technology to find eigenvalues and eigenvectors of matrices.
Solution.
With as given, we have . Technology shows that the eigenvalues of are ,,, and with corresponding orthonormal eigenvectors ,,, and . This makes . The singular values of are ,,, and , so is the matrix with the nonzero singular values along the diagonal and zeros everywhere else. Finally, we define the vectors as . Again, technology gives us ,, and . Thus, a singular value decomposition of is , where
Use the singular value decomposition to find a basis for Col ,Row , and Nul .
Solution.
Recall that the right singular vectors of an matrix of rank form an orthonormal basis of eigenvectors of such that for from 1 to with . These vectors are the columns of the matrix in a singular value decomposition of . For from 1 to , we let . Then the set forms an orthonormal basis of Col . We extend this set to an orthonormal basis of . Recall also that for . Since Nul Col , it follows that the vectors ,,, form a basis for Nul . Finally, the set is a basis for Row . So in our example, we have ,,,,, and . Since the singular values of are ,,, and , it follows that rank. We also have ,, and . So
Find orthogonal matrices and , and the matrix , so that is a singular value decomposition of .
Solution.
Normalizing the eigenvectors ,, and to normal eigenvectors ,, and , respectively, gives us an orthogonal matrix
.
Now , so normalizing the vectors and gives us vectors
and
that are the first two columns of our matrix . Given that is a matrix, we need to find two other vectors orthogonal to and that will combine with and to form an orthogonal basis for . Letting ,,, and , a computer algebra system shows that the reduced row echelon form of the matrix is , so that vectors ,,, are linearly independent. Letting and , the Gram-Schmidt process shows that the set is an orthogonal basis for , where
and (using for )
.
The set where ,, and is an orthonormal basis for and we can let
.
The singular values of are and , and so
.
Therefore, a singular value decomposition of is of
Determine the best rank 1 approximation to . The outer product decomposition of is
.
So the rank one approximation to is
.
Note that the rows in this rank one approximation are the averages of the two distinct rows in the matrix , which makes sense considering that this is the closest rank one matrix to .
We learned about the singular value decomposition of a matrix.
The operator norm of an matrix is
.
The operator norm of a matrix tells us that how big the action of an matrix is can be determined by its action on the unit sphere in .
A singular value decomposition of an matrix is of the form , where
where is an orthonormal basis of eigenvectors of such that for from 1 to with ,
where for from 1 to , and this orthonormal basis of Col is extended to an orthonormal basis of ,
, where for from 1 to .
A singular value decomposition is important in that every matrix has a singular value decomposition, and a singular value decomposition has a variety of applications including scientific computing and digital signal processing, image compression, principal component analysis, web searching through latent semantic indexing, and seismology.
The vectors ,,, in an SVD for an matrix form a basis for Col while the vectors ,,, form a basis for Nul . Also, the set is a basis for Row .
Let have an SVD as in the second bullet. Decomposing as
is an outer product decomposition of . An outer product decomposition allows us to approximate with smaller rank matrices. For example, the matrix is the best rank 1 approximation to , is the best rank 2 approximation, and so on.
Let be an matrix of rank with singular value decomposition , where and . We have seen that the set is a basis for Col , and the vectors ,,, form a basis for Nul . In this exercise we examine the set and determine what this set tells us about Row .
The vectors that form the columns of the matrix in a singular value decomposition of a matrix are eigenvectors of . In this exercise we investigate the vectors that make up the columns of the matrix in a singular value decomposition of a matrix for each between 1 and the rank of , and their connection to the matrix .
Now we examine the result of part (a) in general. Let be an arbitrary matrix. Calculate for rank and determine specifically how is related to . What does this tell us about the vectors and the matrix ?
As an elementary example to illustrate the idea behind Latent Semantic Indexing (LSI), consider the problem of creating a program to search a collection of documents for words, or words related to a given word. Document collections are usually very large, but we use a small example for illustrative purposes. A standard example that is given in several publications 51 is the following. Suppose we have nine documents through (titles of documents about human-computer interaction) and through (titles of graph theory papers) that make up our library:
:Human machine interface for ABC computer applications
: A survey of user opinion of computer system response time
: The EPS user interface management system
:System and human system engineering testing of EPS
: Relation of user perceived response time to error measurement
: The generation of random, binary, ordered trees
: The intersection graph of paths in trees
:Graph minors IV: Widths of trees and well-quasi-ordering
To make a searchable database, one might start by creating a list of key terms that appear in the documents (generally removing common words such as “a”, “the”, “of”, etc., called stop words — these words contribute little, if any, context). In our documents we identify the key words that are shown in italics. (Note that we are just selecting key words to make our example manageable, not necessarily identifying the most important words.) Using the key words we create a term-document matrix. The term-document matrix is the matrix in which the terms form the rows and the documents the columns. If is the term-document matrix, then counts the number of times word appears in document . The term-document matrix for our library is
One of our goals is to rate the pages in our library for relevance if we search for a query. For example, suppose we want to rate the pages for the query survey, computer. This query can be represented by the vector .
In a standard term-matching search with term-document matrix , a query vector would be matched with the terms to determine the number of matches. The matching counts the number of times each document agrees with the query.
We can use the cosine calculation from part (b) to compare matches to our query — the closer the cosine is to , the better the match (dividing by the product of the norms is essentially converting all vectors to unit vectors for comparison purposes). This is often referred to as the cosine distance. Calculate the cosines of the for our example of the query . Order the documents from best to worst match for this query.
Though we were able to rate the documents in Project Activity 29.5 using the cosine distance, the result is less than satisfying. Documents ,, and are all related to computers but do not appear at all in or results. This is a problem with language searches — we don't want to compare just words, but we also need to compare the concepts the words represent. The fact that words can represent different things implies that a random choice of word by different authors can introduce noise into the word-concept relationship. To filter out this noise, we can apply the singular value decomposition to find a smaller set of concepts to better represent the relationships. Before we do so, we examine some useful properties of the term-document matrix.
In Project Activity 29.5 you should have seen that . Assume for the moment that all of the entries in are either or . Explain why in this case the dot product tells us how many terms documents and have in common. Also, the matrix takes dot products of the columns of , which refer to what's happening in each document and so is looking at document-document interactions. For these reasons, we call the document-document matrix.
Use appropriate technology to calculate the entries of the matrix . This matrix is the term-term matrix. Assume for the moment that all of the entries in are either or . Explain why if terms and occur together in documents, then .
To see why a singular value decomposition might be useful, suppose our term-document matrix has singular value decomposition . (Don't actually calculate the SVD yet).
Show that the document-document matrix satisfies . This means that we can compare document and document using the dot product of row and column of the matrix product .
Show that the term-term matrix satisfies . Thus we can compare term and term using the dot product of row and column of the matrix product . (Exercise 6 shows that the columns of are orthogonal eigenvectors of .)
As we will see, the connection of the matrices and to documents and terms that we saw in Project Activity 29.7 will be very useful when we use the SVD of the term-document matrix to reduce dimensions to a “concept” space. We will be able to interpret the rows of the matrices and as providing coordinates for terms and documents in this space.
The singular value decomposition (SVD) allows us to produce new, improved term-document matrices. For this activity, use the term-document matrix in Figure 29.11.
Use appropriate technology to find a singular value decomposition of so that . Print your entries to two decimal places (but keep as many as possible for computational purposes).
Each singular value tells us how important its semantic dimension is. If we remove the smaller singular values (the less important dimensions), we retain the important information but eliminate minor details and noise. We produce a new term-document matrix by keeping the largest of the singular values and discarding the rest. This gives us an approximation
using the outer product decomposition, where ,,, are the largest singular values of . Note that if is an matrix, letting (an matrix), the matrix with the first singular values along the diagonal, and (a matrix), then we can also write . This is sometimes referred to as a reduced SVD. Find ,, and , and find the new term-document matrix .
Once we have our term-document matrix, there are three basic comparisons to make: comparing terms, comparing documents, and comparing terms and documents. Term-document matrices are usually very large, with dimension being the number of terms. By using a reduced SVD we can create a much smaller approximation. In our example, the matrix in Project Activity 29.8 reduces our problem to a -dimensional space. Intuitively, we can think of LSI as representing terms as averages of all of the documents in which they appear and documents as averages of all of the terms they contain. Through this process, LSI attempts to combine the surface information in our library into a deeper abstraction (the “concept” space) that captures the mutual relationships between terms and documents.
We now need to understand how we can represent documents and terms in this smaller space where . Informally, we can consider the rows of as representing the coordinates of each term in the lower dimensional concept space and the columns of as the coordinates of the documents, while the entries of tell us how important each semantic dimension is. The dot product of two row vectors of indicates how terms compare across documents. This product is . Just as in Project Activity 29.7, we have . In other words, if we consider the rows of as coordinates for terms, then the dot products of these rows give us term to term comparisons. (Note that multiplying by just stretches the rows of by the singular values according to the importance of the concept represented by that singular value.) Similarly, the dot product between columns of provide a comparison of documents. This comparison is given by (again by Project Activity 29.7). So we can consider the rows of as providing coordinates for documents.
We have seen how to compare terms to terms and documents to documents. The matrix itself compares terms to documents. Show that , where is the diagonal matrix of the same size as whose diagonal entries are the square roots of the corresponding diagonal entries in . Thus, all useful comparisons of terms and documents can be made using the rows of the matrices and , scaled in some way by the singular values in .
To work in this smaller concept space, it is important to be able to find appropriate comparisons to objects that appeared in the original search. For example, to complete the latent structure view of the system, we must also convert the original query to a representation within the new term-document system represented by . This new representation is called a pseudo-document.
For an original query , we start with its term vector (a vector in the coordinate system determined by the columns of ) and find a representation that we can use as a column of in the document-document comparison matrix. If this representation was perfect, then it would take a real document in the original system given by and produce the corresponding column of if we used the full SVD. In other words, we would have .
In our example, using , the terms can now be represented as -dimensional vectors (the rows of , see Figure 29.12), or as points in the plane. More specifically, human is represented by the vector (to two decimal places) ,interface by , etc. Similarly, the documents are represented by columns of (see Figure 29.13), so that the document is represented by , by , etc. From this perspective we can visualize these documents in the plane. Plot the documents and the query in the 2-dimensional concept space. Then calculate the cosine distances from the query to the documents in this space. Which documents now give the best three matches to the query? Compare the matches to your plot.
As we can see from Project Activity 29.10, the original query had no match at all with any documents except ,, and . In the new concept space, the query now has some connection to every document,. So LSI has made semantic connections between the terms and documents that were not present in the original term-document matrix, which gives us better results for our search.
Technically this definition should be in terms of a supremum, but because the equivalent definition restricts the 's to a compact subset, the sup is achieved and we can use max.
e.g., Deerwester, S., Dumais, S. T., Fumas, G. W., Landauer, T. K. and Harshman, R. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 1990, 41: 391–407, and Landauer, T. and Dutnais, S. A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. Psychological Review, 1997. Vol. 1M. No. 2, 211-240.