<< take 2: term frequency index part 4: but does it scale? >>

take 3: markov chains

sip_markov

using term frequency was a good start but it relies too much on all the terms being infrequent which is not really that fair. for example given the 'stop your jibber and jabber man!' we might like to conclude that 'jibber and jabber' is a reasonable sip even though it includes a common word like 'and'.

what we need is a way to take into account the sequence of words, and markov chains are a great way to do this. i've talked about , before in other experiments (eg this one) so go have a read if you're new to them.

let's ignore term frequencies completely and trying using a markov chain to find the sips

consider the sequence of characters
abaabdabaabdabaabdacdbacd
( the astute reader might notice this is the string 'abaabd'*3 + 'acdbacd' contrived to have 'c' as a rare character and 'a' as a common one )

considering character to character transistion we can build the markov chain

a -> a = 3/11 (0.27) (ie 'a' appears 11 times and is followed by 'a' 3 of those times)
a -> b = 6/11 (0.55)
a -> c = 2/11 (0.18)
b -> a = 4/7 (0.57)
b -> d = 3/7 (0.43)
c -> d = 2/2 (1.00)
d -> a = 3/4 (0.75)
d -> b = 1/4 (0.25)

here's a pic where edge thickness is proportional to the number of transistions

we can define sip_markov(abc) = p(a->b) * p(b->c)
eg from the above markov chain

p(dab) = p(d->a) * p(a->b) = 0.750 * 0.545 = 0.409
p(bac) = p(b->a) * p(a->c) = 0.571 * 0.182 = 0.104

so bac, being less probable, is more of a sip than dab.

calculating sip_markov for all trigram sequences we get...

dab = da * ab = 0.750 * 0.545 = 0.409
bda = bd * da = 0.429 * 0.750 = 0.321 
aba = ab * ba = 0.545 * 0.571 = 0.312 
cdb = cd * db = 1.000 * 0.250 = 0.250 
abd = ab * bd = 0.545 * 0.429 = 0.234 
acd = ac * cd = 0.182 * 1.000 = 0.182 
baa = ba * aa = 0.571 * 0.273 = 0.156 
aab = aa * ab = 0.273 * 0.545 = 0.149 
dba = db * ba = 0.250 * 0.571 = 0.143 
dac = da * ac = 0.750 * 0.182 = 0.136 
bac = ba * ac = 0.571 * 0.182 = 0.104

it's interesting to note that acd and cdb are quite high when they shouldn't really be since both c and d are rarish characters. why is this?

since c is always followed by d and markov chain's are only concerned with the transistions it results in a mulitplier of 1, even though c is very rare.

so it looks like we need a combination of both the transistion probabilities from the markov chain and the term frequency probabilities.

combining sip_mle with sip_markov

how might we define a combined metric sip_combined(abc) in terms of sip_mle('abc') and sip_markov('abc')?

recall sip_mle('abc') = p(a) * p(b) * p(c) and sip_markov('abc') = p(a -> b) * p(b -> c)

i started by using the geometric mean which seemed a natural choice.

sip_combined('abc') = √ ( sip_mle('abc') * sip_markov('abc') )

it gave reasonable results but it turned out to be overwhelmed by the cases of three consecutive hapax legomemon which all result in the same sip_combined value.

here is the derivation of an variation on the geometric mean that weighted the components differently. i use == to denote an operation that produces a value that will retain order since we only really care about finding the least probable without caring about the exact values

sip_combined('abc') 
 = √ ( sip_mle('abc') * sip_markov('abc') )
== sip_mle('abc') * sip_markov('abc') 
 = p(a) * p(b) * p(c) * p(a -> b) * p(b -> c)
 = e ^ ( ln(p(a)) + ln(p(b)) + ln(p(c)) + ln(p(a -> b)) + ln(p(b -> c)) )
== ln(p(a)) + ln(p(b)) + ln(p(c)) + ln(p(a -> b)) + ln(p(b -> c))  
 = x + y + z + u + v   # save some writing
== mean(x, y, z, u, v)

now instead of treating the components (x,y,z,u,v) equally we can take more of a weighted sum approach by grouping the lme components (x,y,z) and the markov components (u,v) seperately first giving a new definition for sip_combined...

sip_combined('abc') 
 = mean( mean(x, y, z), mean(u, v) )
== mean(x, y, z) + mean(u, v)

though it's not obvious from small scale testing whether this gives a fundamentally better result it doesn't seem any worse and it doesn't suffer from the previously mentioned problem of hapax legomemon solutions banding the results.

an example, please....

term probabilities are

a 11/25 == ln(11/25) = -0.82
b 7/25  == ln(7/25)  = -1.27
d 5/25  == ln(5/25)  = -1.60
c 2/25  == ln(2/25)  = -2.52

and recall markov chain probabilities are

a -> a = 3/11 == ln(3/11) = -1.29
a -> b = 6/11 == ln(6/11) = -0.60
a -> c = 2/11 == ln(2/11) = -1.70
b -> a = 4/7  == ln(4/7)  = -0.55
b -> d = 3/7  == ln(3/7)  = -0.84
c -> d = 2/2  == ln(2/2)  =  0.00
d -> a = 3/4  == ln(3/4)  = -0.28
d -> b = 1/4  == ln(1/4)  = -1.38

and combining sip_mle and sip_markov using the weighted geometric mean gives...

bac = (-1.273 -0.821 -2.526)/3 + (-0.560 -1.705)/2 = -4.620/3 -2.264/2 = -2.672
dac = (-1.609 -0.821 -2.526)/3 + (-0.288 -1.705)/2 = -4.956/3 -1.992/2 = -2.648
acd = (-0.821 -2.526 -1.609)/3 + (-1.705 -0.000)/2 = -4.956/3 -1.705/2 = -2.504
dba = (-1.609 -1.273 -0.821)/3 + (-1.386 -0.560)/2 = -3.703/3 -1.946/2 = -2.207
abd = (-0.821 -1.273 -1.609)/3 + (-0.606 -0.847)/2 = -3.703/3 -1.453/2 = -1.961
aab = (-0.821 -0.821 -1.273)/3 + (-1.299 -0.606)/2 = -2.915/3 -1.905/2 = -1.924
baa = (-1.273 -0.821 -0.821)/3 + (-0.560 -1.299)/2 = -2.915/3 -1.859/2 = -1.901
bda = (-1.273 -1.609 -0.821)/3 + (-0.847 -0.288)/2 = -3.703/3 -1.135/2 = -1.802
dab = (-1.609 -0.821 -1.273)/3 + (-0.288 -0.606)/2 = -3.703/3 -0.894/2 = -1.681

building the markov chain

finally how did will we generate the markov chain using map reduce semantics? it's actually very similiar to the task required for building the term frequencies

consider abacabc
we want a markov chain

a > b 2/3
a > c 1/3
b > a 1/2
b > c 1/2
c > a 1/1

first off we need the frequencies of bigrams, and a version of them keyed by the first term. these values make up the nominators of the transistion probabilites in the chain

bash> rake bigrams
bash> rake cat dir=bigrams
ab  2
ac  1
ba  1
bc  1
ca  1
bash> rake bigram_keyed_by_first_elem
bash> rake cat dir=bigram_keyed_by_first_elem
a.1f  ab 2
a.1f  ac 1
b.1f  ba 1
b.1f  bc 1
c.1f  ca 1

then we need the frequencies of the 'from' state. these values make up the denominators of the transistion probabilites in the chain

bash> rake bigram_first_elem
bash> rake cat dir=bigram_first_elem
a.0p 3
b.0p 2
c.0p 1

we can them join these last two

bash> rake markov_chain
bash> rake cat dir=markov_chain
ab.0p  0.66
ac.0p  0.33
ba.0p  0.5
bc.0p  0.5
ca.0p  1

once we have our markov chain model we can join it to the trigrams in a similiar fashion to how we joined the term frequencies with the trigrams on the last page

here's the new set of map reduce passes required...

the bigger example

how does the inclusion of markov_sip change the sips on our 8 document corpus?

doc sip1 sip2 sip3

fbncc10 and numbers edited university of quebec structure in 2000

8sknn10 think a scrimption by she noiselessly he been disgusted

gm77v10 he the venom the not colloquial the not unamiable

7hmvg10 for on dormant and english morello to deep carmine

dwqoh10 thinks and nobobody when one watches the indifferent routinists

7stir10 or la nouvelle to poor ellen poor of marlowe

esper10 in out ow a old maljunec in day aye

8prmt10 de by belmekhira in die ferse sie in feierlichem

sort of interesting... there are some further improvements that could be made;

take an tfidf approach in that you should favour combos that appear a lot in a particular document
allow bigrams not just trigrams, which with the previous approach will pluck out bits of terminology

but for now we've got something workable and for me this has turned more into a learning exercise on hadoop

but does it scale?

<< take 2: term frequency index part 4: but does it scale? >>

sept 2009

doc	sip1	sip2	sip3
fbncc10	and numbers edited	university of quebec	structure in 2000
8sknn10	think a scrimption	by she noiselessly	he been disgusted
gm77v10	he the venom	the not colloquial	the not unamiable
7hmvg10	for on dormant	and english morello	to deep carmine
dwqoh10	thinks and nobobody	when one watches	the indifferent routinists
7stir10	or la nouvelle	to poor ellen	poor of marlowe
esper10	in out ow	a old maljunec	in day aye
8prmt10	de by belmekhira	in die ferse	sie in feierlichem