brain of mat kelcey...

e12.1 statistical synonyms

January 23, 2010 at 12:54 PM | categories: Uncategorized

i've had an idea brewing in my head for awhile now seeded by a great talk by peter norvig about statistically approaches to find patterns in data.

one thing he alludes to is the generation of synoyms based on n-gram models.

the basic intuition is this; if a corpus contains occurrences of the phrases 'a x b' and 'a y b' then to some degree x and y are synonymous.

the question becomes how do we calculate the strength of the relationship? how is it a function of the frequencies of a, b, x, y, 'a x b', 'a y b', 'a ? b' in the corpus. what else can we take into account?