problems with low support
in my last post we went through mutual information as a way of finding collocations.
the astute reader may have noticed that for the list of top bigrams i only showed ones that had a frequency above 5,000.
why this cutoff? well it turns out that one of the criticisms of this definition of mutual information is that it gives whacky results for low support cases.
if we purely just sort by the mutual information score we find that the top 250,000 all have the same score and correpond to bigrams that occur only once in the corpus (and whose terms only appear in that bigram). some examples include "Bruail Brueil", "LW1X LW4X" and "UG-211 GB-HMF" and they are, as often as not it seems, artifacts of parsing quirks.
so how did i decide the minimum support of 5,000? it's just a round number near the 99th percentile of the frequency of frequencies. purely a magic number, not good!
as a concrete implementation i'll just use the LogLikelihood code provided in mahout. (yay for being able to use an arbitrary static java function as a pig udf!)
to use this method we need 4 values, k11, k12, k21 and k22, all conveniently calculatable from the counts we gathered for mutual info
|k11||t1 & t2||freq(t1,t2)|
|k12||t1 & not t2||freq(t1) - freq(t1,t2)|
|k21||not t1 & t2||freq(t2) - freq(t1,t2)|
|k22||not t1 & not t2||total_num_bigrams - ( freq(t1) + freq(t2) - freq(t1,t2) )|
calculating this value for the 1,300,000,000 bigrams of wikipedia (code here) we get these top 10 bigrams...
this surprised me a bit since these bigrams are not at all what i expected.... especially when you compare the results against just ranking by raw bigram frequency (which is obviously much easier to calculate)
|rank basedon llr||rank basedon freq||bigram|
do i have a bug?
at first i thought i must have a bug but manually redoing the top one (of,the) gives the same answer
k11 = f(of,the) = 12,290,443
k12 = f(of) - f(of,the) = 41,478,115 - 12,290,443 = 29,187,672
k21 = f(the) - f(of,the) = 74,807,672 - 12,290,443 = 62,517,229
k22 = total number bigrams - ( f(of) + f(the) - f(of, the) ) = 1,110,473,107 - ( 41,478,115 + 74,807,672 - 12,290,443 ) = 1,006,477,763
and org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(12290443, 29187672, 62517229, 1006477763) = 2.1369480467885613E7
hmmm.. i wonder what i'm missing? to be continued i guess!
weighting by frequency
coming back to mutual information a variation is to simply weight by the bigram frequency.
so instead of sorting on mutual info we can sort on log(bigram freq) * mutual info. note: we can't weight using the bigram frequency directly because the values are on such different scales as the mutual info. rathering just normalising we can reduce the bigram frequency using a logarithm which feels "fair" since these frequencies follow a power law anyways.
ordering by this new metric gives the following results which seem ok. the main thing is i didn't have to specify an arbitrary cut off frequency!
|bigram||bigram freq||mutual info||log(freq) * mutual info|
at least good enough to provide data for the next experiment...