latent semantic analysis via the singular value decomposition

following on from some previous work on classifying documents i wanted to see how well latent semantic analysis (lsa) does at classifying documents.

wha?

usually when comparing documents we do so using the fundamental unit of the text; the actual terms themselves.
lsa gives a way of comparing documents at a higher level than the terms by introducting a concept called the feature.
the singular value decomposition (svd) is a way of extracting features from documents.

an example

lets go through a high level example to help build the initution and see what these features 'look like'

first let's introduce the term occurance matrix, a common way to describe a corpus, where rows represent terms and columns represent documents.
the value of matrix element a_i,j denotes that the i^th term occured n times in the j^th document.

consider the, extremely contrived, documents...

d1: modem the steering linux. modem, linux the modem. steering the modem. linux!
d2: linux; the linux. the linux modem linux. the modem, clutch the modem. petrol.
d3: petrol! clutch the steering, steering, linux. the steering clutch petrol. clutch the petrol; the clutch.
d4: the the the. clutch clutch clutch! steering petrol; steering petrol petrol; steering petrol!!!!

which is represented as the 6x4 document term matrix below (colour introduced just to help see patterns)

d1 d2 d3 d4

linux 3 4 1 0

modem 4 3 0 1

the 3 4 4 3

clutch 0 1 4 3

steering 2 0 3 3

petrol 0 1 3 4

straight away we can see that, based on what words the documents contain, that doc1 and doc2 are alike and doc3 and doc4 are alike

the terms linux and modem are used a lot in the first two docments. one can imagine that they are representive of a concept; we could call it computers
the terms clutch, steering and petrol are used a lot in the last three documents, perhaps they are representive of a concept; we could call it automotive
the term the is an interesting one; it used across all the documents and we can see that this is not really related to either topic, it's more an english construct

lsa will help up extract these features, computers and automotive
it won't though, alas, give us nice human readable names for them :)

next let's look at an example of svd

index example 1 >>

Aug 2009
me on twitter
me on google+

	d1	d2	d3	d4
linux	3	4	1	0
modem	4	3	0	1
the	3	4	4	3
clutch	0	1	4	3
steering	2	0	3	3
petrol	0	1	3	4