after quite a bit of hacking the statistical synonyms idea doesn't seem to give terribly interesting results so i'm going onto do something else.
for the record here's what I did do though....
- generate 3grams from 800e3 tweets
- collect n-grams together that share the same first and last term; eg 'the blue cat', 'the green cat', 'the red cat'
- for each set generate all the combos of the middle terms; eg 'blue green', 'blue red', 'green red'
- count the occurrences of each pair
- draw a graph of the 150 top occurring pairs
some interesting result. few of the more complex things i was trying were working. they were mainly based on trying to incorporate the frequencies of terms but it seemed the simplest gave the best result (i think it's because my assumptions about how to use the data were wrong).
here's the code, feel free to read my notes, correct my incorrect terrible statistical assumptions and make a better image!