<< example 2 index real data example >>
let's move onto an example, similiar to the last, but with some more 'noise'
again let's work with a contrived corpus
d1: a a a a a a b b b b b b b b c c c c c c e e f f d2: a a a a a a a b b b b b b c c c c c c c c c d d3: a c c c c c c c c d d d d d d d d e e e e e e e e e f f f f f f f d4: b c c c c c d d d d d d d d e e e e e e e f f f f f f fwhich is represented as the 6x4 document term matrix
d1 | d2 | d3 | d4 | |
a | 6 | 7 | 1 | 0 |
b | 8 | 6 | 0 | 1 |
c | 6 | 9 | 8 | 5 |
d | 0 | 1 | 8 | 8 |
e | 2 | 0 | 9 | 7 |
f | 2 | 0 | 7 | 7 |
here is a decomposition of A performed again using SVDLIBC
d1 | d2 | d3 | d4 | |
a | 6 | 7 | 1 | 0 |
b | 8 | 6 | 0 | 1 |
c | 6 | 9 | 8 | 5 |
d | 0 | 1 | 8 | 8 |
e | 2 | 0 | 9 | 7 |
f | 2 | 0 | 7 | 7 |
f1 | f2 | f3 | f4 | |
a | 0.24 | -0.51 | 0.08 | 0.06 |
b | 0.25 | -0.54 | -0.64 | -0.23 |
c | 0.58 | -0.28 | 0.57 | 0.13 |
d | 0.42 | 0.37 | 0.16 | -0.68 |
e | 0.44 | 0.34 | -0.24 | 0.66 |
f | 0.39 | 0.29 | -0.40 | -0.09 |
f1 | f2 | f3 | f4 | |
f1 | 23.1 | 0 | 0 | 0 |
f2 | 0 | 14.3 | 0 | 0 |
f3 | 0 | 0 | 3.5 | 0 |
f4 | 0 | 0 | 0 | 1.5 |
d1 | d2 | d3 | d4 | |
f1 | 0.37 | 0.38 | 0.65 | 0.53 |
f2 | -0.55 | -0.63 | 0.37 | 0.38 |
f3 | -0.69 | 0.59 | 0.27 | -0.21 |
f4 | 0.26 | -0.29 | 0.59 | -0.69 |
recall: S describes the relative strengths of the features U describes the relationship between terms (rows) and features (columns) Vt describes the relationshop between features (rows) and documents (columns)
similiarly to our last example we've got two dominant features but the additional non zero term frequencies have meant we've got variance for all possible 4 features. (this is representive of the general non contrived case where if we were dealing with a large number of documents we'd only be interested in the first dominant features) like last time since the first two values are much higher than the second two we can derive that there are two main dominant features in the corpus.
the matrix product VS describes the relation between documents (VS's rows) and the features (VS's columns)
f1 | f2 | f3 | f4 | |
d1 | 8.624 | -7.973 | -2.447 | -0.259 |
d2 | 8.928 | -9.081 | 2.177 | 0.283 |
d3 | 15.116 | 5.402 | 0.627 | -0.956 |
d4 | 13.044 | 5.227 | -0.599 | 1.086 |
the matrix product US describes the relation between terms (US's rows) and the features (US's columns) there is a bit more jitter in the points but similiar analysis as last time holds
f1 | f2 | f3 | f4 | |
a | 5.502 | -7.449 | 0.349 | -0.352 |
b | 5.768 | -7.943 | -2.100 | 0.477 |
c | 14.091 | -3.864 | 1.869 | -0.095 |
d | 9.962 | 5.337 | 0.709 | 0.880 |
e | 10.404 | 4.867 | -1.016 | -1.019 |
f | 9.118 | 4.107 | -1.386 | 0.259 |
ok then, enough of this contrived stuff, let's have a look at an example with real data