The common crawl dataset
Common crawl is a publically available 30TB web crawl taken between September 2009 and September 2010. As a small project I decided to extract and tokenised the visible text of the web pages in this dataset. All the code to do this is on github.
1. Getting the data
Only a few things of note about this job...
- The data in S3 is marked as requester pays which, even though it's a no-op if you're accessing the data from EC2, needs the "x-amz-request-payer" header to be set.
- Pulling from S3 to EC2 is network bound so I ran using the MultithreadedMapRunner to ensure I could get as much throughput as possible.
- The code includes lots of retry logic but also sets mapred.max.map.failures.percent=100 to allow tasks to fail without killing the entire job (Eg there was one s3 object which had bad ACLs that couldn't be read, no amount of retries would have helped)
2. Filtering text/html
The next step was to filter out everything that didn't have a mime type of 'text/html'. This is pretty straightforward since the arc file format specifies the mime type in a header. I used the ArcInputFormat from Apache Nutch to actually generate the hadoop map input records.
Across the 3,000,000,000 objects in the crawl there ended up being 2,000 distinct mime types, 700 of each occuring only once, with about 90% of them being nonsense.
The top five mime types were ...
Even though there's probably interesting content in the non text/html object types it seemed that just handling text/html would get me the biggest bang for my buck.
Initially I had some problems with encoding. Though http response headers include an encoding field that is meant to indicate what encoding the payload is I found it to be wrong about 30% of the time :( I just ignored what the header said and used the CharsetDetector provided in Apache Tika. CharsetDetector inspects a chunk of bytes, uses heuristics to guess the encoding, decodes and reencodes as UTF-8.
3. Extracting visible text
Next was to extract the visible text from this raw html. After playing with a few libraries I ended up going with boilerpipe. In particular I ended up using the KeepEverythingWithMinKWordsExtractor extractor. Boilerpipe, roughly, returns a single line per block element of the html.
4. Filtering for english content
I then used LanguageIdentifier, again a part of Tika, to filter out everything but english text. It didn't seem to have any false positives but looking at the top 5 languages something seems amiss...
I never got around to sampling some of the Lithuanian ones to see what was actually going on but I'm a bit suspicious. I might have actually lost a bit of content in this step...f
This tokeniser was wrapped in a TokeniseSentences hadoop job that did some additional sanity checking like ignoring one/two word sentences etc.
The final output was 92,000,000,000 sentences (3TB gzipped). Next will be to finish porting my near duplicate sketching algorithm to hadoop to run it across this data.