brain of mat kelcey...

collocations in wikipedia, part 1

October 19, 2011 at 08:00 PM | categories: Uncategorized

introduction

collocations are combinations of terms that occur together more frequently than you'd expect by chance.

they can include

proper noun phrases like 'Darth Vader'
stock/colloquial phrases like 'flora and fauna' or 'old as the hills'
common adjectives/noun pairs (notice how 'strong coffee' sounds ok but 'powerful coffee' doesn't?)

let's go through a couple of techniques for finding collocations taken from the exceptional nlp text "foundations of statistical natural language processing" by manning and schutze.

mutual information

the first technique we'll try is mututal information, it's a way of scoring terms based on how often they appear together vs how often they appear separately.

the intuition is that if two (or three) terms appear together a lot, but hardly ever appear without each other, they probably can be treated as a phrase.

a common definition for bigram mutual information is

\( MutualInformation(t1,t2) = log _{2} \frac{P(t1,t2)}{P(t1).P(t2)} \)

and a definition for trigram mutual information i found in the paper A Corpus-based Approach to Automatic Compound Extraction is

\( MutualInformation(t1,t2,t3) = log _{2} \frac{P(t1,t2,t3)}{P(t1).P(t2).P(t3)+P(t1).P(t2,t3)+P(t1,t2).P(t3)} \)

given a corpus we can use simple maximum likelihoods estimates to calculate these probabilities ...

\( P(t1,t2,t3) = \frac{freq(t1,t2,t3)}{\#trigrams} \) and \( P(t1,t2) = \frac{freq(t1,t2)}{\#bigrams} \) and \( P(t1) = \frac{freq(t1)}{\#unigrams} \)

so! we need a corpus! lets ...

grab a freebase wikipedia dump
pass it through the stanford nlp parser to extract all the sentences
build the frequency tables for our ngrams

( see my project on github for the gory details to reproduce )

working with the 2011-10-15 freebase dump we start with 5,700,000 articles. from this the stanford parser extracts 55,000,000 sentences.

from these sentences we can extract some ngrams ...

	#total	#distinct
unigrams	1,386,868,488	8,295,593
bigrams	1,331,695,519	99,340,352
trigrams	1,276,522,552	381,541,510

the top 5 of each being...

unigram	freq
the	74,528,781
,	70,605,655
.	54,902,186
of	41,340,440
and	34,962,970

bigram	freq
of the	12,184,383
in the	8,042,527
, and	7,223,201
, the	4,756,776
to the	4,077,474

trigram	freq
\| \| \|	805,094
, and the	767,814
one of the	617,374
-RRB- is a	562,709
\| - \|	516,652

(note: -RRB- is the tokenised right parenthese)

and, as always, the devil's in the detail when it comes to tokenisation... you always have to make lots of decisions; if we're after word pairs/triples should we just remove single characters such as '-' or '|' ? for this experiment i decided to leave them in as they act as a convenient seperator.

overall the freebase data is clean enough for some hacking. had to remove some stray html markup left in from the original wikimedia parse (so the stanford parser wouldn't implode) but other than that we can get away with ignoring anomalies such as the trigram '| - |' (hoorah for statistical methods!)

bigram mutual information

calculating the mutual information for all the bigrams with a frequency over 5,000 gives the following top ranked ones

rank             bigram  freq   m_info
1          Burkina Faso  5417 17.88616
2       Rotten Tomatoes  5695 17.50873
3          Kuala Lumpur  6441 17.47578
4              Tel Aviv  9106 16.90873
5           Baton Rouge  5587 16.85029
6        Figure Skating  5518 16.44119
7             Lok Sabha  7429 16.43407
8            Notre Dame 13516 16.11460
9          Buenos Aires 20595 16.05346
10    gastropod mollusk 19335 15.92581
11           Costa Rica 11014 15.85664
12         Barack Obama  9742 15.84432
13           vice versa  5205 15.66973
14              hip hop 15727 15.63575
15        Uttar Pradesh  7833 15.63525
16   main-belt asteroid 10551 15.62005
17 Theological Seminary  6131 15.61613
18         Saudi Arabia 14887 15.59454
19                sq mi  8492 15.58054
20            São Paulo 13832 15.53181

these are pretty much all proper nouns and, though they are all great finds, they're not really the adjective/noun phrases i was particularly interested in. i guess it's not too surprising since we've done nothing in terms of POS tagging yet.

see here for the top 1,000

a plot of term frequency vs mutual information score shows an expected huge density of low frequency / low mutual information bigrams. the low frequency / high mutual info ones (in the top left) are the ones in the table above and the high frequency / low mutual info ones (in the bottom right) correspond to boring language constructs such as "of the", "to the" or ", and".

trigram mutual information

what about trigrams? here are the top 20 with a support of over 1,000

rank                   trigram freq   m_info
1                Abdu ` l-Bahá 1011 19.06866
2    Dravida Munnetra Kazhagam 1043 18.98674
3              Ab urbe condita 1059 18.58179
4                Dar es Salaam 1130 18.09764
5              Kitts and Nevis 1095 18.02320
6             Procter & Gamble 1255 17.96789
7          Antigua and Barbuda 1290 17.90375
8          agnostic or atheist 1068 17.84620
9                Vasco da Gama 1401 17.77709
10                Ku Klux Klan 1944 17.77443
11              Ways and Means 1070 17.51264
12             Croix de Guerre 1196 17.46765
13        Jehovah 's Witnesses 2235 17.46177
14                  SV = Saves 1980 17.24957
15               Venue | Crowd 1518 17.24024
16             summa cum laude 1363 17.22880
17        Teenage Mutant Ninja 1003 17.17236
18             Osama bin Laden 1734 17.16104
19             magna cum laude 1566 17.13729
20         Names -LRB- US-ACAN 2813 16.91815

again mostly proper nouns apart from some oddities such as "SV = Saves" (which must be from some type of sports glossary since i also see later "Pts = Points" & "SO = Strikeouts" )

see here for the top 1,000

a plot of frequency vs mutual info is similar to the bigram case.

and the top 10 non capitilised trigrams are curious...

agnostic or atheist
summa cum laude
magna cum laude
unmarried opposite-sex partnerships
flora and fauna
non-institutionalized group quarters
mollusk or micromollusk
italics indicate fastest
air-breathing land snail

air-breathing land snails #ftw !

bigrams at a distance

a variation of the standard bigrams approach is to allow tokens to be treated as bigrams as long as they have no more than 2 tokens between them.

eg 'the cat in the hat' which would usually just have bigrams ['the cat','cat in', 'in the', 'the hat'] instead is defined by the bigrams ['the cat', 'the in', 'the the', 'cat in', 'cat the', 'cat hat', 'in the', 'in hat', 'the hat']

this results in roughly three times the bigrams (3,154,200,111 instead of 1,331,695,519) so it's slightly more processing but it allows tokens to influence each other at a short distance

calculating the mutual information for these bigrams gives a slightly different set

rank                   trigram  freq   m_info
1                    expr expr 20888 17.04807
2                    ifeq ifeq  6507 16.18608
3                 Burkina Faso  5473 16.14546
4              Rotten Tomatoes  5705 15.75572
5                 Kuala Lumpur  6457 15.72382
5                SO Strikeouts  5788 15.56487
6        Masovian east-central  8452 15.41213
7                    Earned SO  5651 15.40456
8                  Wins Losses  7984 15.23901
9                     Tel Aviv  9137 15.15810
10                 Baton Rouge  5599 15.09785
11            Dungeons Dragons  5509 14.84334
12             Trinidad Tobago  6241 14.77053
13              Figure Skating  5528 14.68826
14                   Lok Sabha  7435 14.67970
15     background-color E9E9E9  8490 14.65283
16              Haleakala NEAT  5328 14.51430
17             Kitt Spacewatch 17854 14.43547

so more noise from the original freebase parse eg (expr, expr) or (background-color, E9E9E9)

interesting to see it picks up what would otherwise be a trigram with a middle 'and' eg (Dungeons, Dragons) and (Trinidad, Tobago)

summary

so we've found lots of proper nouns! these can be very useful if you're doing feature extraction for a classifier that doesn't like dependent features; a tokenisation of ['Barack Obama', 'went', 'to', 'Kuala Lumpur'] if often better than ['Barack', 'Obama', 'went', 'to', 'Kuala', 'Lumpur']

coming up next, the mean/sd distance method...