Apache’s Mahout Project
by Paul O'Rorke on Apr.21, 2009, under Meeting Notes
Jeff Eastman gave a presentation on Mahout at the SDForum Business Intelligence Special Interest Group’s meeting on April 21st, 2009. Mahout is a collection of machine learning algorithms adapted for use on very large data sets using the Hadoop map-reduce platform. Jeff’s presentation “BI Over Petabytes: Meet Apache Mahout” gave a good introduction to Mahout and a snapshot of the current status. His slides are available here and in the SDForum Archives.
Jeff started out by talking about the relevance of Machine Learning to Business Intelligence and gave examples of methods such as clustering and collaborative filtering. He listed some important applications and highlighted Amazon’s “customers who bought this item also bought…” and Google’s list of “Top Stories” in the news.
Mahout is motivated by the need for an open source machine learning platform that can scale to process large sets of data available on the web. The current code base includes the Taste collaborative filtering implementation, the Watchmaker evolutionary algorithm, several naive Bayesian classification programs and a variety of clustering methods including canopy, k-means, mean shift, and Dirichlet process clustering.
The project is still at an early stage: it came out of the Apache Lucene (search engine) project in January 2008 and the .1 version was released this month (April 2009). It has not yet been applied to the kind of large data sets that motivated its development. Since the project is an open source project there have been two impediments: access to large data sets and the cost of large runs on cloud computing utilities.
Jeff drilled down to the clustering methods and showed an example of how they would work on sample data produced by a Dirichlet process model based on astronomy. The two-dimensional example showed how the different clustering methods operate using data like points one might observe looking at globular clusters of stars outside of our galaxy thru the Milky Way. Jeff also showed an example of how Hadoop is used in the Dirichlet process clustering implementation by exhibiting the Java code for the mapper and for the reducer. Although the examples are slightly simplified for presentation, it is still impressive how little extraneous overhead is required to integrate the basic clustering algorithm with Hadoop.
I was pleasantly surprised that there seems to be great interest in Mahout, since I feared machine learning algorithms might be seen as an academic or arcane topic. But the attendance at this meeting broke the BI SIG’s previous record and we had to bring new chairs in to seat everyone. I’m sure Mahout will be used to solve some interesting real-world problems soon and the people who attended Jeff’s talk will be among the early adapters.
Leave a Reply
You must be logged in to post a comment.