Web Mining in the Cloud
by Paul O'Rorke on Nov.01, 2009, under Meeting Notes
Ken Krugler gave an interesting presentation on elastic web data mining at the 2009 Silicon Valley Data Mining Camp. Ken is the founder of Bixo Labs, Inc. Ken’s session was part of the half-day “unconference” organized by the Bay Area ACM at the Hacker’s Dojo in Mountain View on Sunday, November 1st, 2009.
Ken explained the steps required in web mining and how web mining differs from other kinds of data mining (for example the scale is typically much larger and the data is relatively unstructured). Ken then presented his platform for web data mining. The platform uses Hadoop, EC2, Cascading, and Bixo (HECB). The Bixo toolkit helps you collect web data, for example by making it easier to crawl the web in ways that will avoid hammering web sites and getting blocked. Cascading helps manage workflow. Amazon’s EC2 and Hadoop provide elastic map-reduce and a reliable distributed file system that enable you to scale to very large computational problems as needed.
Ken gave two concrete examples and showed their implementation on the HECB platform. The first example was of a keyword analysis that might be done by competing companies and the second example involved email list mining. It was impressive to see how much of the processing was generated by Bixo and Cascading and how only a small fraction of the code needed to be custom coded “by hand.”
P.S. As a contribution to the web mining community, BixoLabs and Concurrent are developing a Public Terabyte Dataset that will be hosted in S3 at Amazon and that will be free to EC2 users. See http://bixolabs.com/datasets/public-terabyte-dataset-project/)
Leave a Reply
You must be logged in to post a comment.
November 4th, 2009 on 11:02 am
[...] 4 tags: cascading, elastic web mining, workflow by kkrugler Paul posted a nice summary of my elastic web mining talk over at his blog. He captured one of the key points I was trying to make when he said: It was [...]