<<  take 2: term frequency    index    part 4: but does it scale?  >>

take 3: markov chains

sip_markov

using term frequency was a good start but it relies too much on all the terms being infrequent which is not really that fair. for example given the 'stop your jibber and jabber man!' we might like to conclude that 'jibber and jabber' is a reasonable sip even though it includes a common word like 'and'.

what we need is a way to take into account the sequence of words, and markov chains are a great way to do this. i've talked about , before in other experiments (eg this one) so go have a read if you're new to them.

let's ignore term frequencies completely and trying using a markov chain to find the sips

consider the sequence of characters
abaabdabaabdabaabdacdbacd
( the astute reader might notice this is the string 'abaabd'*3 + 'acdbacd' contrived to have 'c' as a rare character and 'a' as a common one )

considering character to character transistion we can build the markov chain

a -> a = 3/11 (0.27) (ie 'a' appears 11 times and is followed by 'a' 3 of those times)
a -> b = 6/11 (0.55)
a -> c = 2/11 (0.18)
b -> a = 4/7 (0.57)
b -> d = 3/7 (0.43)
c -> d = 2/2 (1.00)
d -> a = 3/4 (0.75)
d -> b = 1/4 (0.25)
here's a pic where edge thickness is proportional to the number of transistions

we can define sip_markov(abc) = p(a->b) * p(b->c)
eg from the above markov chain

p(dab) = p(d->a) * p(a->b) = 0.750 * 0.545 = 0.409
p(bac) = p(b->a) * p(a->c) = 0.571 * 0.182 = 0.104 
so bac, being less probable, is more of a sip than dab.

calculating sip_markov for all trigram sequences we get...

dab = da * ab = 0.750 * 0.545 = 0.409
bda = bd * da = 0.429 * 0.750 = 0.321 
aba = ab * ba = 0.545 * 0.571 = 0.312 
cdb = cd * db = 1.000 * 0.250 = 0.250 
abd = ab * bd = 0.545 * 0.429 = 0.234 
acd = ac * cd = 0.182 * 1.000 = 0.182 
baa = ba * aa = 0.571 * 0.273 = 0.156 
aab = aa * ab = 0.273 * 0.545 = 0.149 
dba = db * ba = 0.250 * 0.571 = 0.143 
dac = da * ac = 0.750 * 0.182 = 0.136 
bac = ba * ac = 0.571 * 0.182 = 0.104 

it's interesting to note that acd and cdb are quite high when they shouldn't really be since both c and d are rarish characters. why is this?

since c is always followed by d and markov chain's are only concerned with the transistions it results in a mulitplier of 1, even though c is very rare.

so it looks like we need a combination of both the transistion probabilities from the markov chain and the term frequency probabilities.

combining sip_mle with sip_markov

how might we define a combined metric sip_combined(abc) in terms of sip_mle('abc') and sip_markov('abc')?

recall sip_mle('abc') = p(a) * p(b) * p(c) and sip_markov('abc') = p(a -> b) * p(b -> c)

i started by using the geometric mean which seemed a natural choice.

sip_combined('abc') = √ ( sip_mle('abc') * sip_markov('abc') )
it gave reasonable results but it turned out to be overwhelmed by the cases of three consecutive hapax legomemon which all result in the same sip_combined value.

here is the derivation of an variation on the geometric mean that weighted the components differently. i use == to denote an operation that produces a value that will retain order since we only really care about finding the least probable without caring about the exact values

sip_combined('abc') 
 = √ ( sip_mle('abc') * sip_markov('abc') )
== sip_mle('abc') * sip_markov('abc') 
 = p(a) * p(b) * p(c) * p(a -> b) * p(b -> c)
 = e ^ ( ln(p(a)) + ln(p(b)) + ln(p(c)) + ln(p(a -> b)) + ln(p(b -> c)) )
== ln(p(a)) + ln(p(b)) + ln(p(c)) + ln(p(a -> b)) + ln(p(b -> c))  
 = x + y + z + u + v   # save some writing
== mean(x, y, z, u, v)
now instead of treating the components (x,y,z,u,v) equally we can take more of a weighted sum approach by grouping the lme components (x,y,z) and the markov components (u,v) seperately first giving a new definition for sip_combined...
sip_combined('abc') 
 = mean( mean(x, y, z), mean(u, v) )
== mean(x, y, z) + mean(u, v) 
though it's not obvious from small scale testing whether this gives a fundamentally better result it doesn't seem any worse and it doesn't suffer from the previously mentioned problem of hapax legomemon solutions banding the results.

an example, please....

term probabilities are

a 11/25 == ln(11/25) = -0.82
b 7/25  == ln(7/25)  = -1.27
d 5/25  == ln(5/25)  = -1.60
c 2/25  == ln(2/25)  = -2.52

and recall markov chain probabilities are

a -> a = 3/11 == ln(3/11) = -1.29
a -> b = 6/11 == ln(6/11) = -0.60
a -> c = 2/11 == ln(2/11) = -1.70
b -> a = 4/7  == ln(4/7)  = -0.55
b -> d = 3/7  == ln(3/7)  = -0.84
c -> d = 2/2  == ln(2/2)  =  0.00
d -> a = 3/4  == ln(3/4)  = -0.28
d -> b = 1/4  == ln(1/4)  = -1.38

and combining sip_mle and sip_markov using the weighted geometric mean gives...
bac = (-1.273 -0.821 -2.526)/3 + (-0.560 -1.705)/2 = -4.620/3 -2.264/2 = -2.672
dac = (-1.609 -0.821 -2.526)/3 + (-0.288 -1.705)/2 = -4.956/3 -1.992/2 = -2.648
acd = (-0.821 -2.526 -1.609)/3 + (-1.705 -0.000)/2 = -4.956/3 -1.705/2 = -2.504
dba = (-1.609 -1.273 -0.821)/3 + (-1.386 -0.560)/2 = -3.703/3 -1.946/2 = -2.207
abd = (-0.821 -1.273 -1.609)/3 + (-0.606 -0.847)/2 = -3.703/3 -1.453/2 = -1.961
aab = (-0.821 -0.821 -1.273)/3 + (-1.299 -0.606)/2 = -2.915/3 -1.905/2 = -1.924
baa = (-1.273 -0.821 -0.821)/3 + (-0.560 -1.299)/2 = -2.915/3 -1.859/2 = -1.901
bda = (-1.273 -1.609 -0.821)/3 + (-0.847 -0.288)/2 = -3.703/3 -1.135/2 = -1.802
dab = (-1.609 -0.821 -1.273)/3 + (-0.288 -0.606)/2 = -3.703/3 -0.894/2 = -1.681

building the markov chain

finally how did will we generate the markov chain using map reduce semantics? it's actually very similiar to the task required for building the term frequencies

consider abacabc
we want a markov chain

a > b 2/3
a > c 1/3
b > a 1/2
b > c 1/2
c > a 1/1

first off we need the frequencies of bigrams, and a version of them keyed by the first term. these values make up the nominators of the transistion probabilites in the chain

bash> rake bigrams
bash> rake cat dir=bigrams
ab  2
ac  1
ba  1
bc  1
ca  1
bash> rake bigram_keyed_by_first_elem
bash> rake cat dir=bigram_keyed_by_first_elem
a.1f  ab 2
a.1f  ac 1
b.1f  ba 1
b.1f  bc 1
c.1f  ca 1

then we need the frequencies of the 'from' state. these values make up the denominators of the transistion probabilites in the chain

bash> rake bigram_first_elem
bash> rake cat dir=bigram_first_elem
a.0p 3
b.0p 2
c.0p 1

we can them join these last two

bash> rake markov_chain
bash> rake cat dir=markov_chain
ab.0p  0.66
ac.0p  0.33
ba.0p  0.5
bc.0p  0.5
ca.0p  1

once we have our markov chain model we can join it to the trigrams in a similiar fashion to how we joined the term frequencies with the trigrams on the last page

here's the new set of map reduce passes required...

the bigger example

how does the inclusion of markov_sip change the sips on our 8 document corpus?
docsip1sip2sip3
fbncc10and numbers editeduniversity of quebecstructure in 2000
8sknn10think a scrimptionby she noiselesslyhe been disgusted
gm77v10he the venomthe not colloquialthe not unamiable
7hmvg10for on dormantand english morelloto deep carmine
dwqoh10thinks and nobobodywhen one watchesthe indifferent routinists
7stir10or la nouvelleto poor ellenpoor of marlowe
esper10in out owa old maljunecin day aye
8prmt10de by belmekhirain die fersesie in feierlichem

sort of interesting... there are some further improvements that could be made;

  1. take an tfidf approach in that you should favour combos that appear a lot in a particular document
  2. allow bigrams not just trigrams, which with the previous approach will pluck out bits of terminology
but for now we've got something workable and for me this has turned more into a learning exercise on hadoop

but does it scale?

<<  take 2: term frequency    index    part 4: but does it scale?  >>

sept 2009