example 3: two less clear features (revisited)

let's move onto an example, similiar to the last, but with some more 'noise'

again let's work with a contrived corpus

d1: a a a a a a  b b b b b b b b  c c c c c c  e e  f f 
d2: a a a a a a a  b b b b b b  c c c c c c c c c  d
d3: a  c c c c c c c c  d d d d d d d d  e e e e e e e e e  f f f f f f f
d4: b  c c c c c  d d d d d d d d  e e e e e e e  f f f f f f f

which is represented as the 6x4 document term matrix

d1 d2 d3 d4

a 6 7 1 0

b 8 6 0 1

c 6 9 8 5

d 0 1 8 8

e 2 0 9 7

f 2 0 7 7

singular value decomposition

here is a decomposition of A performed again using SVDLIBC

A

	d1	d2	d3	d4
a	6	7	1	0
b	8	6	0	1
c	6	9	8	5
d	0	1	8	8
e	2	0	9	7
f	2	0	7	7

=

U

	f1	f2	f3	f4
a	0.24	-0.51	0.08	0.06
b	0.25	-0.54	-0.64	-0.23
c	0.58	-0.28	0.57	0.13
d	0.42	0.37	0.16	-0.68
e	0.44	0.34	-0.24	0.66
f	0.39	0.29	-0.40	-0.09

x

S

	f1	f2	f3	f4
f1	23.1	0	0	0
f2	0	14.3	0	0
f3	0	0	3.5	0
f4	0	0	0	1.5

x

Vt

	d1	d2	d3	d4
f1	0.37	0.38	0.65	0.53
f2	-0.55	-0.63	0.37	0.38
f3	-0.69	0.59	0.27	-0.21
f4	0.26	-0.29	0.59	-0.69

recall:
S describes the relative strengths of the features
U describes the relationship between terms (rows) and features (columns)
Vt describes the relationshop between features (rows) and documents (columns)

interpretation of S

similiarly to our last example we've got two dominant features but the additional non zero term frequencies have meant we've got variance for all possible 4 features.
(this is representive of the general non contrived case where if we were dealing with a large number of documents we'd only be interested in the first dominant features)
like last time since the first two values are much higher than the second two we can derive that there are two main dominant features in the corpus.

interpretation of VS

the matrix product VS describes the relation between documents (VS's rows) and the features (VS's columns)

f1 f2 f3 f4

d1 8.624 -7.973 -2.447 -0.259

d2 8.928 -9.081 2.177 0.283

d3 15.116 5.402 0.627 -0.956

d4 13.044 5.227 -0.599 1.086

interpretation of US

the matrix product US describes the relation between terms (US's rows) and the features (US's columns)
there is a bit more jitter in the points but similiar analysis as last time holds

f1 f2 f3 f4

a 5.502 -7.449 0.349 -0.352

b 5.768 -7.943 -2.100 0.477

c 14.091 -3.864 1.869 -0.095

d 9.962 5.337 0.709 0.880

e 10.404 4.867 -1.016 -1.019

f 9.118 4.107 -1.386 0.259

ok then, enough of this contrived stuff, let's have a look at an example with real data

<< example 2 index real data example >>

	f1	f2	f3	f4
d1	8.624	-7.973	-2.447	-0.259
d2	8.928	-9.081	2.177	0.283
d3	15.116	5.402	0.627	-0.956
d4	13.044	5.227	-0.599	1.086

	f1	f2	f3	f4
a	5.502	-7.449	0.349	-0.352
b	5.768	-7.943	-2.100	0.477
c	14.091	-3.864	1.869	-0.095
d	9.962	5.337	0.709	0.880
e	10.404	4.867	-1.016	-1.019
f	9.118	4.107	-1.386	0.259