brain of mat kelcey...

collocations in wikipedia, part 2

November 05, 2011 at 05:00 PM | categories: Uncategorized

problems with low support

in my last post we went through mutual information as a way of finding collocations.

the astute reader may have noticed that for the list of top bigrams i only showed ones that had a frequency above 5,000.

why this cutoff? well it turns out that one of the criticisms of this definition of mutual information is that it gives whacky results for low support cases.

if we purely just sort by the mutual information score we find that the top 250,000 all have the same score and correpond to bigrams that occur only once in the corpus (and whose terms only appear in that bigram). some examples include "Bruail Brueil", "LW1X LW4X" and "UG-211 GB-HMF" and they are, as often as not it seems, artifacts of parsing quirks.

so how did i decide the minimum support of 5,000? it's just a round number near the 99th percentile of the frequency of frequencies. purely a magic number, not good!

likelihood ratios

another approach that doesn't suffer this problem with low frequency bigrams is to use likelihood ratios. one such test is the g-test very well described in this blog post by ted dunning

as a concrete implementation i'll just use the LogLikelihood code provided in mahout. (yay for being able to use an arbitrary static java function as a pig udf!)

to use this method we need 4 values, k11, k12, k21 and k22, all conveniently calculatable from the counts we gathered for mutual info

k_value	description	calculated from
k11	t1 & t2	freq(t1,t2)
k12	t1 & not t2	freq(t1) - freq(t1,t2)
k21	not t1 & t2	freq(t2) - freq(t1,t2)
k22	not t1 & not t2	total_num_bigrams - ( freq(t1) + freq(t2) - freq(t1,t2) )

calculating this value for the 1,300,000,000 bigrams of wikipedia (code here) we get these top 10 bigrams...

rank	bigram	llr
1	of the	21,369,480
2	in the	12,669,724
3	the the	10,743,814
4	United States	10,490,802
5	is a	8,948,460
6	New York	6,973,104
7	such as	6,175,861
8	the of	5,821,300
9	to be	5,374,484
10	has been	5,348,722

this surprised me a bit since these bigrams are not at all what i expected.... especially when you compare the results against just ranking by raw bigram frequency (which is obviously much easier to calculate)

rank based on llr	rank based on freq	bigram
1	1	of the
2	2	in the
3	20,000	the the
4	31	United States
5	4	is a
6	48	New York
7	27	such as
8	18,000	the of
9	14	to be
10	36	has been

do i have a bug?

at first i thought i must have a bug but manually redoing the top one (of,the) gives the same answer

k11 = f(of,the) = 12,290,443

k12 = f(of) - f(of,the) = 41,478,115 - 12,290,443 = 29,187,672

k21 = f(the) - f(of,the) = 74,807,672 - 12,290,443 = 62,517,229

k22 = total number bigrams - ( f(of) + f(the) - f(of, the) ) = 1,110,473,107 - ( 41,478,115 + 74,807,672 - 12,290,443 ) = 1,006,477,763

and org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(12290443, 29187672, 62517229, 1006477763) = 2.1369480467885613E7

hmmm.. i wonder what i'm missing? to be continued i guess!

weighting by frequency

coming back to mutual information a variation is to simply weight by the bigram frequency.

so instead of sorting on mutual info we can sort on log(bigram freq) * mutual info. note: we can't weight using the bigram frequency directly because the values are on such different scales as the mutual info. rathering just normalising we can reduce the bigram frequency using a logarithm which feels "fair" since these frequencies follow a power law anyways.

ordering by this new metric gives the following results which seem ok. the main thing is i didn't have to specify an arbitrary cut off frequency!

bigram	bigram freq	mutual info	log(freq) * mutual info
fn org	45050	14.699	157.514
Buenos Aires	20682	15.808	157.092
Socorro LINEAR	97365	13.576	155.943
gastropod mollusk	19342	15.687	154.835
Hong Kong	67738	13.827	153.804
Los Angeles	134801	12.883	152.172
Tel Aviv	9144	16.667	152.018
Burkina Faso	5407	17.649	151.705
Kuala Lumpur	6450	17.233	151.173
Notre Dame	13546	15.873	151.021

at least good enough to provide data for the next experiment...