brain of mat kelcey

collocations in wikipedia, part 2

November 05, 2011

problems with low support

in my last post we went through mutual information as a way of finding collocations.

the astute reader may have noticed that for the list of top bigrams i only showed ones that had a frequency above 5,000.

why this cutoff? well it turns out that one of the criticisms of this definition of mutual information is that it gives whacky results for low support cases.

if we purely just sort by the mutual information score we find that the top 250,000 all have the same score and correpond to bigrams that occur only once in the corpus (and whose terms only appear in that bigram). some examples include "Bruail Brueil", "LW1X LW4X" and "UG-211 GB-HMF" and they are, as often as not it seems, artifacts of parsing quirks.

so how did i decide the minimum support of 5,000? it's just a round number near the 99th percentile of the frequency of frequencies. purely a magic number, not good!

likelihood ratios

another approach that doesn't suffer this problem with low frequency bigrams is to use likelihood ratios. one such test is the g-test very well described in this blog post by ted dunning

as a concrete implementation i'll just use the LogLikelihood code provided in mahout. (yay for being able to use an arbitrary static java function as a pig udf!)

to use this method we need 4 values, k11, k12, k21 and k22, all conveniently calculatable from the counts we gathered for mutual info

k_value description calculated from
k11 t1 & t2 freq(t1,t2)
k12 t1 & not t2 freq(t1) - freq(t1,t2)
k21 not t1 & t2 freq(t2) - freq(t1,t2)
k22 not t1 & not t2 total_num_bigrams - ( freq(t1) + freq(t2) - freq(t1,t2) )

calculating this value for the 1,300,000,000 bigrams of wikipedia (code here) we get these top 10 bigrams...

1of the21,369,480
2in the 12,669,724
3the the 10,743,814
4United States 10,490,802
5is a 8,948,460
6New York 6,973,104
7such as 6,175,861
8the of 5,821,300
9to be 5,374,484
10has been 5,348,722

this surprised me a bit since these bigrams are not at all what i expected.... especially when you compare the results against just ranking by raw bigram frequency (which is obviously much easier to calculate)

rank based
on llr
rank based
on freq
11of the
22in the
320,000the the
431United States
54is a
648New York
727such as
818,000the of
914to be
1036has been

do i have a bug?

at first i thought i must have a bug but manually redoing the top one (of,the) gives the same answer

k11 = f(of,the) = 12,290,443

k12 = f(of) - f(of,the) = 41,478,115 - 12,290,443 = 29,187,672

k21 = f(the) - f(of,the) = 74,807,672 - 12,290,443 = 62,517,229

k22 = total number bigrams - ( f(of) + f(the) - f(of, the) ) = 1,110,473,107 - ( 41,478,115 + 74,807,672 - 12,290,443 ) = 1,006,477,763

and org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(12290443, 29187672, 62517229, 1006477763) = 2.1369480467885613E7

hmmm.. i wonder what i'm missing? to be continued i guess!

weighting by frequency

coming back to mutual information a variation is to simply weight by the bigram frequency.

so instead of sorting on mutual info we can sort on log(bigram freq) * mutual info. note: we can't weight using the bigram frequency directly because the values are on such different scales as the mutual info. rathering just normalising we can reduce the bigram frequency using a logarithm which feels "fair" since these frequencies follow a power law anyways.

ordering by this new metric gives the following results which seem ok. the main thing is i didn't have to specify an arbitrary cut off frequency!

bigram bigram freq mutual info log(freq) *
mutual info
fn org 45050 14.699 157.514
Buenos Aires 20682 15.808 157.092
Socorro LINEAR 97365 13.576 155.943
gastropod mollusk 19342 15.687 154.835
Hong Kong 67738 13.827 153.804
Los Angeles 134801 12.883 152.172
Tel Aviv 9144 16.667 152.018
Burkina Faso 5407 17.649 151.705
Kuala Lumpur 6450 17.233 151.173
Notre Dame 13546 15.873 151.021

at least good enough to provide data for the next experiment...