brain of mat kelcey

finding phrases with mutual information

November 15, 2011 | View Comments

finding phrases with mutual information

continuing on with my series of mutual information experiments how might we extend the technique to identity sequences longer than just two terms?

one novel way is to identify the bigrams of interest, replace them with a single token and simply repeat the entire process. (thanks ted for the idea)

example

so say we had the 6 term sentence i went to new york city

it has 5 bigrams; ('i went', 'went to', 'to new', 'new york', 'york city')

running the mutual information algorithm over this might identify new york as a bigram of interest.

we can swap the two terms with a single token (new_york) giving us a new sentence with 5 terms; i went to '(new_york)' city

this new sentence has 4 bigrams ('i went', 'went to', 'to (new_york)', '(new_york) city')

another run of mutual information might now identify the pair (new_york) city so we replace it with the token ((new_york)_city) and just keep repeating.

data

lets run this over a small sample of 300,000 sentences taken from visible text of the freebase wiki dump after it's been tokenised by the stanford parser

(to speed things a little i calculate mutual information and replace the top 10 bigrams in the text before recalculating)

example starting sentences include...

A solid and dependable performer Taylor held the record having played in games for the Phillies at second base t...
A surface may also exhibit both specular and diffuse reflection as is the case for example of glossy paint as us...
A variety of names have since been given to the Wandering Jew including Matathias Buttadeus Paul Marrane and Isa...
A.D.A.M. has control over Eggman 's computer and therefore every robot he owns he can also spread to other compu...
Absolute magnitude magazine cover Though this image is subject to copyright its use is covered by the U.S. fair ...


results

what phrases do we find?

after the first iteration we get the bigrams we've seen before...

Socorro LINEAR
expr    expr
United  States
Los     Angeles
median  income


but after the second iteration we get a mix of single term bigrams and immediately start seeing some new composite bigrams; in this case the trigram 'per square mile'

(expr_expr)     (expr_expr)
capita  income
(t_t)   t
per     (square_mile)
Las     Vegas


unfortunately there's lots of noise too. 'expr expr expr expr' comes from an single sentence, the term 'expr' repeated 450 times, that must have been poorly parsed originally. the 't t t' case is something similar.

by the 16th iteration we get our first 4gram phrase ' U.S. fair use laws'

had     been
U.S.    ((fair_use)_laws)
Rotten  Tomatoes
science fiction
(New_York)      City


and by the 70th iteration we get our first 5gram phrase 'United Nations Security Council Resolution'. jujitsu fans out there will be pleased to see some grappling coming in too!

alas more rubbish as well with the align styling tags leaking in.

(((United_Nations)_Security)_Council)   Resolution
Submission      (rear_(naked_choke))
Asian   (Pacific_Islander)
(UD_(align_left))       ((align_left)_((align_center)_(Win_(align_left))))
lieutenant      colonel


it's only two passes later that we get a big continuation of this one with 'United Nations Security Council Resolution adopted unanimously'

(((((United_Nations)_Security)_Council)_Resolution)_adopted)    unanimously
(United_States) ((align_left)_((align_center)_(Win_(align_left))))
Flying  Corps
TKO     punches


i was a bit suspicous of this one but grabbing the original text we can see how it makes for an interesting construct in the text...

United Nations Security Council Resolution adopted unanimously on August after recalling Resolution the Council ...
United Nations Security Council Resolution adopted unanimously on March after recalling all previous resolutions...
United Nations Security Council Resolution adopted unanimously on February after noting that the Council had bee...
United Nations Security Council Resolution adopted unanimously on December after reaffirming all resolutions on ...
United Nations Security Council Resolution adopted unanimously on May after a complaint by Senegal against Portu...
United Nations Security Council Resolution adopted unanimously on June after recalling resolutions and the Counc...
United Nations Security Council Resolution adopted unanimously on July after noting the recent entry into force ...
United Nations Security Council Resolution adopted unanimously on May after recalling all resolutions on the sit...
United Nations Security Council Resolution adopted unanimously on January after recalling all previous resolutio...
United Nations Security Council Resolution adopted unanimously on June after hearing representations from Botswa...
United Nations Security Council Resolution adopted unanimously on May after reaffirming Resolution and all subse...
United Nations Security Council Resolution adopted unanimously on August after reaffirming previous resolutions ...
United Nations Security Council Resolution adopted unanimously on December after reaffirming all resolutions on ...
United Nations Security Council Resolution was adopted unanimously on October after recalling resolutions and on...
United Nations Security Council resolution adopted unanimously on March after reaffirming resolutions and on the...
United Nations Security Council Resolution adopted unanimously on June after recalling all previous resolutions ...
United Nations Security Council Resolution adopted unanimously on January after reaffirming Resolution on the si...
United Nations Security Council Resolution adopted unanimously on February after reaffirming resolutions and in ...


interesting. i wonder has this come from a template perhaps? maybe just cut n paste? one author with fixed style?

even by the end of my run, 950 iterations, (aka last night) there continue to be valid short phrases being picked up

(County_Kansas) (United_States)
County  Clare
Sunday  night
Rift    Valley
Charlton        Heston


how has the corpus changed?

during the processing we've been replacing these tokens in the original text. so how does it look by this time? well, not a whole lot has changed actually.

the following 3 random examples show how little the text differs (should have left it running much longer!!)

(He_played) for a (short_time) with (Duke_Ellington) for (which_he) is (best_remembered)

His (debut_single) Mi God Mi King topped the Jamaican (singles_chart) and a string of hits
followed including Heel And Toe Monkey And Ape (Ghost_Rider) and Crucifixion although his
best-remembered song is Mini Bus which lamented the demise (of_the) jolly bus and which
(was_awarded) the title Song Of The Year in (from_the) Jamaica (Broadcasting_Corporation)

However this number is certainly an improvement (from_the) cars it averaged yearly ((during_the)_1980s)


what are the longest phrases identified?

the top three are noise alas...

 rank numunderscores phrase 1 127 expr_expr_expr_..... (128 times) 2 95 September_Socorro_LINEAR_September_Socorro_LINEAR_... (32 times) 3 63 t_t_t_... (64 times)

it's not until the 77th we get something that isn't (arguably) just a repeated pattern or noisy parsing

 rank numunderscores phrase 77 10 At_the_census_there_were_people_households_and_families_residing_in

which were identified as phrases due to the large frequency of occurances of variations of the following...

 freq original phrase 72 As of the census of there were people households and families residing in the city 60 As of the census of there were people households and families residing in the town 46 As of the census of there were people households and families residing in the CDP 32 As of the census of there were people households and families residing in the village 26 As of the census of there were people households and families residing in the township 15 As of the census of there were people households and families residing in the borough 3 At the census there were people households and families residing in the city 2 At the census there were people households and families residing in the village 2 As of the census of there were people households and families residing on the base

fascinating stuff!!

random todo thoughts

• use patterns found in this experiment to clean up noise and rerun
• work out a way to fold the composition into the scoring
• work on larger dataset
• approach to dealing with duplicates? don't want to just uniq since they represent something