brain of mat kelcey...
finding phrases with mutual information
November 15, 2011 at 11:00 PM | categories: Uncategorizedfinding phrases with mutual information
continuing on with my series of mutual information experiments how might we extend the technique to identity sequences longer than just two terms?
one novel way is to identify the bigrams of interest, replace them with a single token and simply repeat the entire process. (thanks ted for the idea)
example
so say we had the 6 term sentence i went to new york city
it has 5 bigrams; ('i went', 'went to', 'to new', 'new york', 'york city')
running the mutual information algorithm over this might identify new york as a bigram of interest.
we can swap the two terms with a single token (new_york) giving us a new sentence with 5 terms; i went to '(new_york)' city
this new sentence has 4 bigrams ('i went', 'went to', 'to (new_york)', '(new_york) city')
another run of mutual information might now identify the pair (new_york) city so we replace it with the token ((new_york)_city) and just keep repeating.
data
lets run this over a small sample of 300,000 sentences taken from visible text of the freebase wiki dump after it's been tokenised by the stanford parser
(to speed things a little i calculate mutual information and replace the top 10 bigrams in the text before recalculating)
example starting sentences include...
A solid and dependable performer Taylor held the record having played in games for the Phillies at second base t... A surface may also exhibit both specular and diffuse reflection as is the case for example of glossy paint as us... A variety of names have since been given to the Wandering Jew including Matathias Buttadeus Paul Marrane and Isa... A.D.A.M. has control over Eggman 's computer and therefore every robot he owns he can also spread to other compu... Absolute magnitude magazine cover Though this image is subject to copyright its use is covered by the U.S. fair ...
results
what phrases do we find?
after the first iteration we get the bigrams we've seen before...
Socorro LINEAR expr expr United States Los Angeles median income
but after the second iteration we get a mix of single term bigrams and immediately start seeing some new composite bigrams; in this case the trigram 'per square mile'
(expr_expr) (expr_expr) capita income (t_t) t per (square_mile) Las Vegas
unfortunately there's lots of noise too. 'expr expr expr expr' comes from an single sentence, the term 'expr' repeated 450 times, that must have been poorly parsed originally. the 't t t' case is something similar.
by the 16th iteration we get our first 4gram phrase ' U.S. fair use laws'
had been U.S. ((fair_use)_laws) Rotten Tomatoes science fiction (New_York) City
and by the 70th iteration we get our first 5gram phrase 'United Nations Security Council Resolution'. jujitsu fans out there will be pleased to see some grappling coming in too!
alas more rubbish as well with the align styling tags leaking in.
(((United_Nations)_Security)_Council) Resolution Submission (rear_(naked_choke)) Asian (Pacific_Islander) (UD_(align_left)) ((align_left)_((align_center)_(Win_(align_left)))) lieutenant colonel
it's only two passes later that we get a big continuation of this one with 'United Nations Security Council Resolution adopted unanimously'
(((((United_Nations)_Security)_Council)_Resolution)_adopted) unanimously (United_States) ((align_left)_((align_center)_(Win_(align_left)))) Flying Corps Saddam Hussein TKO punches
i was a bit suspicous of this one but grabbing the original text we can see how it makes for an interesting construct in the text...
United Nations Security Council Resolution adopted unanimously on August after recalling Resolution the Council ... United Nations Security Council Resolution adopted unanimously on March after recalling all previous resolutions... United Nations Security Council Resolution adopted unanimously on February after noting that the Council had bee... United Nations Security Council Resolution adopted unanimously on December after reaffirming all resolutions on ... United Nations Security Council Resolution adopted unanimously on May after a complaint by Senegal against Portu... United Nations Security Council Resolution adopted unanimously on June after recalling resolutions and the Counc... United Nations Security Council Resolution adopted unanimously on July after noting the recent entry into force ... United Nations Security Council Resolution adopted unanimously on May after recalling all resolutions on the sit... United Nations Security Council Resolution adopted unanimously on January after recalling all previous resolutio... United Nations Security Council Resolution adopted unanimously on June after hearing representations from Botswa... United Nations Security Council Resolution adopted unanimously on May after reaffirming Resolution and all subse... United Nations Security Council Resolution adopted unanimously on August after reaffirming previous resolutions ... United Nations Security Council Resolution adopted unanimously on December after reaffirming all resolutions on ... United Nations Security Council Resolution was adopted unanimously on October after recalling resolutions and on... United Nations Security Council resolution adopted unanimously on March after reaffirming resolutions and on the... United Nations Security Council Resolution adopted unanimously on June after recalling all previous resolutions ... United Nations Security Council Resolution adopted unanimously on January after reaffirming Resolution on the si... United Nations Security Council Resolution adopted unanimously on February after reaffirming resolutions and in ...
interesting. i wonder has this come from a template perhaps? maybe just cut n paste? one author with fixed style?
even by the end of my run, 950 iterations, (aka last night) there continue to be valid short phrases being picked up
(County_Kansas) (United_States) County Clare Sunday night Rift Valley Charlton Heston
how has the corpus changed?
during the processing we've been replacing these tokens in the original text. so how does it look by this time? well, not a whole lot has changed actually.
the following 3 random examples show how little the text differs (should have left it running much longer!!)
(He_played) for a (short_time) with (Duke_Ellington) for (which_he) is (best_remembered) His (debut_single) Mi God Mi King topped the Jamaican (singles_chart) and a string of hits followed including Heel And Toe Monkey And Ape (Ghost_Rider) and Crucifixion although his best-remembered song is Mini Bus which lamented the demise (of_the) jolly bus and which (was_awarded) the title Song Of The Year in (from_the) Jamaica (Broadcasting_Corporation) However this number is certainly an improvement (from_the) cars it averaged yearly ((during_the)_1980s)
what are the longest phrases identified?
the top three are noise alas...
rank | numunderscores | phrase |
1 | 127 | expr_expr_expr_..... (128 times) |
2 | 95 | September_Socorro_LINEAR_September_Socorro_LINEAR_... (32 times) |
3 | 63 | t_t_t_... (64 times) |
it's not until the 77th we get something that isn't (arguably) just a repeated pattern or noisy parsing
rank | numunderscores | phrase |
77 | 10 | At_the_census_there_were_people_households_and_families_residing_in |
which were identified as phrases due to the large frequency of occurances of variations of the following...
freq | original phrase |
72 | As of the census of there were people households and families residing in the city |
60 | As of the census of there were people households and families residing in the town |
46 | As of the census of there were people households and families residing in the CDP |
32 | As of the census of there were people households and families residing in the village |
26 | As of the census of there were people households and families residing in the township |
15 | As of the census of there were people households and families residing in the borough |
3 | At the census there were people households and families residing in the city |
2 | At the census there were people households and families residing in the village |
2 | As of the census of there were people households and families residing on the base |
fascinating stuff!!
random todo thoughts
- use patterns found in this experiment to clean up noise and rerun
- work out a way to fold the composition into the scoring
- work on larger dataset
- approach to dealing with duplicates? don't want to just uniq since they represent something