# brain of mat kelcey

## pseudocounts and the good-turing estimation (part1)

April 03, 2011 at 03:04 PM | categories: pseudocounts, statistics | View Comments

say we are running the bar at a soldout bad religion concert. the bar serves beer, scotch and water and we decide to record orders over the night so that we can know how much to order for tomorrow's gig...drink#salesbeer1000scotch300water200using these numbers we can predict a number of things..what is the chance the next person will order a beer?it's a pretty simple probability; 1000 beers / 1500 total drinks = 0.66 or 66%what is the chance the next person will order a water?also straightforward; 200 waters / 1500 total drinks = 0.14 or 14%now say we run the t-shirt stand...

## e12.1 statistical synonyms

January 23, 2010 at 12:54 PM | categories: e12, statistics | View Comments

i've had an idea brewing in my head for awhile now seeded by a great talk by peter norvig about statistically approaches to find patterns in data.one thing he alludes to is the generation of synoyms based on n-gram models.the basic intuition is this; if a corpus contains occurrences of the phrases 'a x b' and 'a y b' then to some degree x and y are synonymous.the question becomes how do we calculate the strength of the relationship? how is it a function of the frequencies of a, b, x, y, 'a x b', 'a y b', 'a ?...

## simple statistics with R

October 03, 2009 at 03:43 PM | categories: statistics, r, language | View Comments

i'm learning a new statistics language called R and it's pretty cool.make a vector ...12> c(3,1,4,1,5,9,2,6,5,3,5,8) [1] 3 1 4 1 5 9 2 6 5 3 5 8turn it into a frequency table ...123> table(c(3,1,4,1,5,9,2,6,5,3,5,8))1 2 3 4 5 6 8 92 1 2 1 3 1 1 1sort by frequency ...123> sort(table(c(3,1,4,1,5,9,2,6,5,3,5,8)))2 4 6 8 9 1 3 51 1 1 1 1 2 2 3and plot!1> barplot(sort(table(c(3,1,4,1,5,9,2,6,5,3,5,8))))so simple!...

## do a degree via youtube

October 01, 2009 at 08:40 PM | categories: lectures, statistics, stanford, machine learning | View Comments

i'm amazed by how much great content is on youtube, how could you NOT learn something!?13 x 1hr Statistical Aspects of Data Mining (Stats 202)20 x 1hr Machine Learning...

old projects...

- latent semantic analysis via the singular value decomposition (for dummies)
- semi supervised naive bayes
- statistical synonyms
- round the world tweets
- decomposing social graphs on twitter
- do it yourself statistically improbable phrases
- should i burn it?
- the median of a trillion numbers
- deduping with resemblance metrics
- simple supervised learning / should i read it?
- audioscrobbler experiments
- chaoscope experiment