<<  intro    index    take 2: term frequency  >>

take 1: trigram frequency

as a warm up we'll analyse the frequencies of trigrams (ie sequences of three terms) in the text.

eg. the document "the cat sat on the mat" has 4 trigrams;
"the cat sat", "cat sat on", "sat on the" and "on the mat"

intuitively it feels like trigrams with a low frequency are candiate sips (most definitely the converse is true). however this naive approach alone won't work too well since, according to zipf's law we expect almost all trigrams to be unique.

we'll also continue to be lazy^H^H^H^Hpragmatic by simplifying in the following ways...

  1. remove non alpha numeric characters
  2. allow the trigrams to span a sentence boundary
  3. collapse each file into a single line to allow trigrams to span what would have been line boundaries

digression: hadoop

hadoop is an opensource implemenation of a map reduce framework very well suited to this type of data stream processing task. it includes a great streaming interface allowing map reduce tasks to be implemented using any language. all you need is a program that will accept records as a line of stdin and emit key/value pairs on stdout

eg trigram frequency generation can be done as a single map task...

#!/usr/bin/env ruby
def emit tuple
 puts "LongValueSum:#{tuple.join(' ')}\t1"
end
 
NGRAM_SIZE = 3
 
STDIN.each do |line|
 terms = line.strip.split
 next if terms.size < NGRAM_SIZE
 
 tuple = []
 NGRAM_SIZE.times { tuple << terms.shift }
 emit tuple
 
 while not terms.empty?
  tuple.shift
  tuple << terms.shift
  emit tuple
 end
 
end

...and run in hadoop as easily as

$HADOOP_HOME/bin/hadoop \
 jar $HADOOP_HOME/contrib/streaming/hadoop-0.20.0-streaming.jar \
 -input input_dir \
 -output trigrams \
 -mapper emit_ngrams.rb \
 -reducer aggregate
(see this note about using an aggregate reducer)

what could be easier?? ( but don't worry, it'll get hard soon enough ) have a look at the code on github

take 1: trigram frequency (continued)

consider a sample corpus of 8 randomly chosen gutenburg texts. (55,000 lines in total)

7hmvg10 Home Vegetable Gardening (1911)
7stir10 The Journal of Arthur Stirling (1903)
8prmt10 Prometheus (1772 german poetry)
8sknn10 The Story Of Kennett (1866)
dwqoh10 Widger's Quotations from the Works of Oliver W. Holmes, Sr.
esper10 A Complete Grammar of Esperanto
fbncc10 The first 1001 Fibonacci Numbers
gm77v10 One of Our Conquerors (1897)

we end up with 340,000 trigrams occurences. not surprisingly they are almost all unique.

the following table says...
there was 1 trigram that appeared 97 times; the phrase 'one of the'
there were 10 trigrams that appeared 30 times; an example is 'even if you'
there were 247,000 trigrams that appeared only once; an example is 'addition during three'

trigram frequency histogram
freq of freq (*) trigram freq (*) examples
197one of the
195i do not
.........
1030even if you, that is the, what do you
.........
4,3003have seen that, must be a, you think it
18,0002such a word, than flowers it, was merely the
247,0001addition during three, had an incorrigible, ne lasu min

conclusions

if we were to deem the low frequency trigrams as sips we'd have nearly nothing but sips!
as expected just considering trigram frequency is not enough.

hmmm, lets try to improve by making use of term frequencies

<<  intro    index    take 2: term frequency  >>

sept 2009