# brain of mat kelcey...

## collocations in wikipedia, part 2

November 05, 2011 at 05:00 PM | categories: Uncategorized

## problems with low support

in my last post we went through mutual information as a way of finding collocations.

the astute reader may have noticed that for the list of top bigrams i only showed ones that had a frequency above 5,000.

why this cutoff? well it turns out that one of the criticisms of this definition of mutual information is that it gives whacky results for low support cases.

if we purely just sort by the mutual information score we find that the top 250,000 all have the same score and correpond to bigrams that occur only once in the corpus (and whose terms only appear in that bigram). some examples include "Bruail Brueil", "LW1X LW4X" and "UG-211 GB-HMF" and they are, as often as not it seems, artifacts of parsing quirks.

so how did i decide the minimum support of 5,000? it's just a round number near the 99th percentile of the frequency of frequencies. purely a magic number, not good!

## likelihood ratios

another approach that doesn't suffer this problem with low frequency bigrams is to use likelihood ratios. one such test is the g-test very well described in this blog post by ted dunning

as a concrete implementation i'll just use the LogLikelihood code provided in mahout. (yay for being able to use an arbitrary static java function as a pig udf!)

to use this method we need 4 values, k11, k12, k21 and k22, all conveniently calculatable from the counts we gathered for mutual info

 k_value description calculated from k11 t1 & t2 freq(t1,t2) k12 t1 & not t2 freq(t1) - freq(t1,t2) k21 not t1 & t2 freq(t2) - freq(t1,t2) k22 not t1 & not t2 total_num_bigrams - ( freq(t1) + freq(t2) - freq(t1,t2) )

calculating this value for the 1,300,000,000 bigrams of wikipedia (code here) we get these top 10 bigrams...

 rank bigram llr 1 of the 21,369,480 2 in the 12,669,724 3 the the 10,743,814 4 United States 10,490,802 5 is a 8,948,460 6 New York 6,973,104 7 such as 6,175,861 8 the of 5,821,300 9 to be 5,374,484 10 has been 5,348,722

this surprised me a bit since these bigrams are not at all what i expected.... especially when you compare the results against just ranking by raw bigram frequency (which is obviously much easier to calculate)

 rank basedon llr rank basedon freq bigram 1 1 of the 2 2 in the 3 20,000 the the 4 31 United States 5 4 is a 6 48 New York 7 27 such as 8 18,000 the of 9 14 to be 10 36 has been

## do i have a bug?

at first i thought i must have a bug but manually redoing the top one (of,the) gives the same answer

k11 = f(of,the) = 12,290,443

k12 = f(of) - f(of,the) = 41,478,115 - 12,290,443 = 29,187,672

k21 = f(the) - f(of,the) = 74,807,672 - 12,290,443 = 62,517,229

k22 = total number bigrams - ( f(of) + f(the) - f(of, the) ) = 1,110,473,107 - ( 41,478,115 + 74,807,672 - 12,290,443 ) = 1,006,477,763

and org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(12290443, 29187672, 62517229, 1006477763) = 2.1369480467885613E7

hmmm.. i wonder what i'm missing? to be continued i guess!

## weighting by frequency

coming back to mutual information a variation is to simply weight by the bigram frequency.

so instead of sorting on mutual info we can sort on log(bigram freq) * mutual info. note: we can't weight using the bigram frequency directly because the values are on such different scales as the mutual info. rathering just normalising we can reduce the bigram frequency using a logarithm which feels "fair" since these frequencies follow a power law anyways.

ordering by this new metric gives the following results which seem ok. the main thing is i didn't have to specify an arbitrary cut off frequency!

 bigram bigram freq mutual info log(freq) * mutual info fn org 45050 14.699 157.514 Buenos Aires 20682 15.808 157.092 Socorro LINEAR 97365 13.576 155.943 gastropod mollusk 19342 15.687 154.835 Hong Kong 67738 13.827 153.804 Los Angeles 134801 12.883 152.172 Tel Aviv 9144 16.667 152.018 Burkina Faso 5407 17.649 151.705 Kuala Lumpur 6450 17.233 151.173 Notre Dame 13546 15.873 151.021

at least good enough to provide data for the next experiment...