<< intro index take 2: term frequency >>
as a warm up we'll analyse the frequencies of trigrams (ie sequences of three terms) in the text.
eg. the document "the cat sat on the mat" has 4 trigrams; "the cat sat", "cat sat on", "sat on the" and "on the mat"
intuitively it feels like trigrams with a low frequency are candiate sips (most definitely the converse is true). however this naive approach alone won't work too well since, according to zipf's law we expect almost all trigrams to be unique.
we'll also continue to be lazy^H^H^H^Hpragmatic by simplifying in the following ways...
hadoop is an opensource implemenation of a map reduce framework very well suited to this type of data stream processing task. it includes a great streaming interface allowing map reduce tasks to be implemented using any language. all you need is a program that will accept records as a line of stdin and emit key/value pairs on stdout
eg trigram frequency generation can be done as a single map task...
#!/usr/bin/env ruby def emit tuple puts "LongValueSum:#{tuple.join(' ')}\t1" end NGRAM_SIZE = 3 STDIN.each do |line| terms = line.strip.split next if terms.size < NGRAM_SIZE tuple = [] NGRAM_SIZE.times { tuple << terms.shift } emit tuple while not terms.empty? tuple.shift tuple << terms.shift emit tuple end end
...and run in hadoop as easily as
$HADOOP_HOME/bin/hadoop \ jar $HADOOP_HOME/contrib/streaming/hadoop-0.20.0-streaming.jar \ -input input_dir \ -output trigrams \ -mapper emit_ngrams.rb \ -reducer aggregate(see this note about using an aggregate reducer)
what could be easier?? ( but don't worry, it'll get hard soon enough ) have a look at the code on github
consider a sample corpus of 8 randomly chosen gutenburg texts. (55,000 lines in total)
7hmvg10 Home Vegetable Gardening (1911) 7stir10 The Journal of Arthur Stirling (1903) 8prmt10 Prometheus (1772 german poetry) 8sknn10 The Story Of Kennett (1866) dwqoh10 Widger's Quotations from the Works of Oliver W. Holmes, Sr. esper10 A Complete Grammar of Esperanto fbncc10 The first 1001 Fibonacci Numbers gm77v10 One of Our Conquerors (1897)
we end up with 340,000 trigrams occurences. not surprisingly they are almost all unique.
the following table says... there was 1 trigram that appeared 97 times; the phrase 'one of the' there were 10 trigrams that appeared 30 times; an example is 'even if you' there were 247,000 trigrams that appeared only once; an example is 'addition during three'
trigram frequency histogram | |||
freq of freq (*) | trigram freq (*) | examples | |
1 | 97 | one of the | |
1 | 95 | i do not | |
... | ... | ... | |
10 | 30 | even if you, that is the, what do you | |
... | ... | ... | |
4,300 | 3 | have seen that, must be a, you think it | |
18,000 | 2 | such a word, than flowers it, was merely the | |
247,000 | 1 | addition during three, had an incorrigible, ne lasu min |
if we were to deem the low frequency trigrams as sips we'd have nearly nothing but sips! as expected just considering trigram frequency is not enough.
hmmm, lets try to improve by making use of term frequencies
<< intro index take 2: term frequency >>
sept 2009