brain of mat kelcey...
collocations in wikipedia, part 2
November 05, 2011 at 05:00 PM | categories: Uncategorizedproblems with low support
in my last post we went through mutual information as a way of finding collocations.
the astute reader may have noticed that for the list of top bigrams i only showed ones that had a frequency above 5,000.
why this cutoff? well it turns out that one of the criticisms of this definition of mutual information is that it gives whacky results for low support cases.
if we purely just sort by the mutual information score we find that the top 250,000 all have the same score and correpond to bigrams that occur only once in the corpus (and whose terms only appear in that bigram). some examples include "Bruail Brueil", "LW1X LW4X" and "UG-211 GB-HMF" and they are, as often as not it seems, artifacts of parsing quirks.
so how did i decide the minimum support of 5,000? it's just a round number near the 99th percentile of the frequency of frequencies. purely a magic number, not good!
likelihood ratios
another approach that doesn't suffer this problem with low frequency bigrams is to use likelihood ratios. one such test is the g-test very well described in this blog post by ted dunning
as a concrete implementation i'll just use the LogLikelihood code provided in mahout. (yay for being able to use an arbitrary static java function as a pig udf!)
to use this method we need 4 values, k11, k12, k21 and k22, all conveniently calculatable from the counts we gathered for mutual info
k_value | description | calculated from |
k11 | t1 & t2 | freq(t1,t2) |
k12 | t1 & not t2 | freq(t1) - freq(t1,t2) |
k21 | not t1 & t2 | freq(t2) - freq(t1,t2) |
k22 | not t1 & not t2 | total_num_bigrams - ( freq(t1) + freq(t2) - freq(t1,t2) ) |
calculating this value for the 1,300,000,000 bigrams of wikipedia (code here) we get these top 10 bigrams...
rank | bigram | llr |
1 | of the | 21,369,480 |
2 | in the | 12,669,724 |
3 | the the | 10,743,814 |
4 | United States | 10,490,802 |
5 | is a | 8,948,460 |
6 | New York | 6,973,104 |
7 | such as | 6,175,861 |
8 | the of | 5,821,300 |
9 | to be | 5,374,484 |
10 | has been | 5,348,722 |
this surprised me a bit since these bigrams are not at all what i expected.... especially when you compare the results against just ranking by raw bigram frequency (which is obviously much easier to calculate)
rank basedon llr | rank basedon freq | bigram |
1 | 1 | of the |
2 | 2 | in the |
3 | 20,000 | the the |
4 | 31 | United States |
5 | 4 | is a |
6 | 48 | New York |
7 | 27 | such as |
8 | 18,000 | the of |
9 | 14 | to be |
10 | 36 | has been |
do i have a bug?
at first i thought i must have a bug but manually redoing the top one (of,the) gives the same answer
k11 = f(of,the) = 12,290,443
k12 = f(of) - f(of,the) = 41,478,115 - 12,290,443 = 29,187,672
k21 = f(the) - f(of,the) = 74,807,672 - 12,290,443 = 62,517,229
k22 = total number bigrams - ( f(of) + f(the) - f(of, the) ) = 1,110,473,107 - ( 41,478,115 + 74,807,672 - 12,290,443 ) = 1,006,477,763
and org.apache.mahout.math.stats.LogLikelihood.logLikelihoodRatio(12290443, 29187672, 62517229, 1006477763) = 2.1369480467885613E7
hmmm.. i wonder what i'm missing? to be continued i guess!
weighting by frequency
coming back to mutual information a variation is to simply weight by the bigram frequency.
so instead of sorting on mutual info we can sort on log(bigram freq) * mutual info. note: we can't weight using the bigram frequency directly because the values are on such different scales as the mutual info. rathering just normalising we can reduce the bigram frequency using a logarithm which feels "fair" since these frequencies follow a power law anyways.
ordering by this new metric gives the following results which seem ok. the main thing is i didn't have to specify an arbitrary cut off frequency!
bigram | bigram freq | mutual info | log(freq) * mutual info |
fn org | 45050 | 14.699 | 157.514 |
Buenos Aires | 20682 | 15.808 | 157.092 |
Socorro LINEAR | 97365 | 13.576 | 155.943 |
gastropod mollusk | 19342 | 15.687 | 154.835 |
Hong Kong | 67738 | 13.827 | 153.804 |
Los Angeles | 134801 | 12.883 | 152.172 |
Tel Aviv | 9144 | 16.667 | 152.018 |
Burkina Faso | 5407 | 17.649 | 151.705 |
Kuala Lumpur | 6450 | 17.233 | 151.173 |
Notre Dame | 13546 | 15.873 | 151.021 |
at least good enough to provide data for the next experiment...