brain of mat kelcey...
e12.1 statistical synonyms
January 23, 2010 at 12:54 PM | categories: Uncategorizedi've had an idea brewing in my head for awhile now seeded by a great talk by peter norvig about statistically approaches to find patterns in data.
one thing he alludes to is the generation of synoyms based on n-gram models.
the basic intuition is this; if a corpus contains occurrences of the phrases 'a x b' and 'a y b' then to some degree x and y are synonymous.
the question becomes how do we calculate the strength of the relationship? how is it a function of the frequencies of a, b, x, y, 'a x b', 'a y b', 'a ? b' in the corpus. what else can we take into account?