brain of mat kelcey...

finding names in common crawl

August 18, 2012 at 08:00 PM | categories: Uncategorized

the central offering from common crawl is the raw bytes they've downloaded and, though this is useful for some people, a lot of us just want the visible text of web pages. luckily they've done this extraction as a part of post processing the crawl and it's freely available too!

getting the data

the first thing we need to do is determine which segments of the crawl are valid and ready for use (the crawl is always ongoing)

$ s3cmd get s3://aws-publicdatasets/common-crawl/parse-output/valid_segments.txt
$ head -n3 valid_segments.txt
1341690147253
1341690148298
1341690149519

given these segment ids we can lookup the related textData objects.

if you just want one grab it's name using something like ...

$ s3cmd ls s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690147253/ 2>/dev/null \
 | grep textData | head -n1 | awk '{print $4}'
s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690147253/textData-00000

but if you want the lot you can get the listing with ...

$ cat valid_segments.txt \
 | xargs -I{} s3cmd ls s3://aws-publicdatasets/common-crawl/parse-output/segment/{}/ \
 | grep textData | awk '{print $4}' > all_valid_segments.tsv

( note: this listing is roughly 200,000 textData files and takes awhile to fetch )

each textData file is a hadoop sequence files, the key being the crawled url and the value being the extracted visible text.

to have a quick look at one you can get hadoop to dump the sequence file contents with ...

$ hadoop fs -text textData-00000 | less
http://webprofessionals.org/intel-to-acquire-mcafee-moving-into-online-security-ny-times/       Web Professionals
Professional association for web designers, developers, marketers, analysts and other web professionals.
Home
...
The company’s share price has fallen about 20 percent in the last five years, closing on Wednesday at $19.59 a share.
Intel, however, has been bulking up its software arsenal. Last year, it bought Wind River for $884 million, giving it a software maker with a presence in the consumer electronics and wireless markets.
With McAfee, Intel will take hold of a company that sells antivirus software to consumers and businesses and a suite of more sophisticated security products and services aimed at corporations.

( note: the visible text is broken into one line per block element from the original html. as such the value in the key/value pairs includes carriage returns and, for something like less, gets outputted as being seperate lines )

extracting noun phrases

now that we have some text, what can we do with it? one thing is to look for noun phrases and the quickest simplest way is to use something like the python natural language toolkit. it's certainly not the fastest to run but for most people would be the quickest to get going.

extract_noun_phrases.py is an example of doing sentence then word tokenisation, pos tagging and finally noun chunk phrase extraction.

given the text ...

Last year, Microsoft bought Wind River for $884 million. This makes it the largest software maker with a presence in North Kanada.

it extract noun phrases ...

Microsoft
Wind River
North Kanada

to run this at larger scale we can wrap it in a simple streaming job

hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
 -input textDataFiles \
 -output counts \
 -mapper extract_noun_phrases.py \
 -reducer aggregate \
 -file extract_noun_phrases.py

run it across a small 50mb sample of textData files the top noun phrases extracted ...

rank	phrase	freq
1	10094	Posted
2	9597	November
3	9553	February
4	8929	Copyright
5	8726	September
6	8709	January
7	8434	April
8	8307	August
9	7963	October
10	7963	December

this is not terribly interesting and the main thing that's going on here is that this is just being extracted from the boiler plate of the pages. one tough problem when dealing with visible text on a web page is that it might be visible but that doesn't mean it's interesting to the actual content of the page. here we see 'posted' and 'copyright', we're just extracting the chrome of the page.

check out the full list of values with freq >= 20 here there are some more interesting ones a bit later

notes

so it's fun to look at noun phrases but i've actually brushed over some key details here

not filtering on english text first generates a lot of "noise". "G úûv ÝT M", "U ŠDú T" and "Y CKdñˆô" are not terribly interesting english noun phrases.
running this at scale you'd probably want to change from streaming and start using an in process java library like the stanford parser
when it comes to actually doing named entity recognition it's a bit more complex. there's a wavii blog post from manish that talks a bit more about it.