Hadoop Boot Camp
by Paul O'Rorke on Mar.06, 2009, under Meeting Notes
Scale Unlimited held its first public “Hadoop Boot Camp” at the Plug and Play Center in Redwood City on March 5th and 6th, 2009. Hadoop is an Apache open source project used by Yahoo that includes a bundle of related sub-projects supporting distributed computing using MapReduce. It is becoming a “virtual OS for your data center” for many large distributable problems. Yahoo is a major contributor and uses Hadoop extensively on large clusters. Yahoo and Hadoop won the Terabyte sort benchmark contest in 2008 (the first Java and open-source entrant to win) using 910 nodes with two quad core Xeons per node. Hadoop has been used on a two thousand node cluster and the current design goal is 10,000 nodes.
Scale Unlimited is a new company specializing in Hadoop training and Principals Chris Wensel and Stefan Groschupf serve as friendly “Drill Sergeants.” Their two day training session includes hands-on labs as well as lectures and it is a great way to learn a lot about Hadoop Core and related technologies in a short period of time. They strike a nice balance by making everything compact and concentrated while avoiding making things indigestible, opaque, or overwhelming.
APPLICATIONS
Hadoop is being used in both “on demand” and in “always available” configurations. I am mainly interested in using Hadoop for analytics and data mining and other Business Intelligence applications. So it was interesting to hear that Hadoop is becoming an alternative to database applications including analytics and many companies (including at least one large telecom company with researchers at the Boot Camp) are considering or are using Hadoop to replace or supplement Oracle or Teradata based analyses.
ARCHITECTURE
Chris and Stefan gave a nice description of the architecture of Hadoop starting at a very abstract high level and then adding detail progressively through the sessions. One of the things that impressed me is that Hadoop is data center, rack, and node “aware” and it uses separation between machines, racks and geographically separated data centers to improve redundancy and reduce the costs of communicating over slower connections. However, Chris and Stefan do not recommend running Hadoop over multiple data centers for reasons given below.
REDUNDANCY, RELIABILITY & ROBUSTNESS
In general, Hadoop fails over automatically (for example if a Task Tracker, Data Node, Mapper or Reducer fails). However, there is a “single point of failure”: the master node running the NameNode and JobTracker. The NameNode maintains the filesystem namespace and gets heartbeats from DataNodes. There is a “Secondary NameNode” but it is not a backup, it just does some of the work for the “primary” NameNode. The JobTracker manages Jobs (single MapReduce executions) for Hadoop clients. It is recommended if there is a choice (e.g., if Hadoop is running in a datacenter that you have control over) to run these processes on a high end more reliable machine while everything else can run on commodity hardware that is less reliable. It is not recommended to run Hadoop over separate data centers because the master node will be located in one data center and communication with nodes in other data centers will be less reliable and slower and there is no redundancy advantage since the whole job will be toast if the master node goes down or becomes unavailable.
Replicants of data blocks are distributed over nodes in a way that is rack aware, making it less likely that all replicants will be lost. ”Speculative execution” allows some percentage of the Hadoop jobs to be run multiple times in case some jobs fail. It may make sense for huge clusters but for most problems it is discouraged.
PERFORMANCE
Stefan pointed out that “the whole idea of Hadoop is to move processing close to the data.” This strategy is also used by many database vendors (e.g., Oracle). In the case of databases, vendors moved things like data mining algorithms into the database. In Hadoop’s case, code is moved to a machine that has data that will be process by it so that it goes to memory or disk instead of getting data over the network. Hadoop has several components that contribute to the distribution of code and data including a “distributed cache.”
Hadoop reuses objects all the time for higher performance. This reminds me of the way the early Java application server Kiva obtained high performance in part by avoiding the memory and time overhead of creating new objects.
Hadoop’s file system (HDFS) is optimized for reads not writes: in particular, it is optimized for many concurrent serialized reads of large files but only one writer is currently supported reliably. You should use large files (e.g., 500MB) and avoid having lots of smaller files.
DISTRIBUTED DATABASES
Chris and Stefan pointed out that I/O for Hadoop is usually files but can be databases, for example Oracle in and HBase out. (HBase is Hadoop’s distributed large database based on Google’s BigTable.) Ian Kallen, one of the participants, has a good short note on distributed databases at Ian’s blog. Chris and Stefan pointed out that HDFS does not allow appends afterwards and this is the source of some reliability issues with databases like HBase and Hypertable built on HDFS but they should be resolved soon. In addition, these distributed databases do not support joins. You can do map-reduce in order to do joins or use additional tools.
SCALE
Chris mentioned that he generally uses 25-40 node clusters and sometimes uses 100 node EC2 clusters on large problems.
MANAGEMENT & MONITORING
The management and monitoring tools for Hadoop seemed fairly primitive. There are several web pages that come with Hadoop and there is an open source application called Ganglia. But there seems to be a lot of room for improvement in this area. Ian mentioned that Hyperic, a cloud application monitoring company, has been approached by Cloudera and they are talking about coming out with a more sophisticated monitoring tool.
HISTORY
Hadoop was originally a spinoff from Nutch, an open source search engine project. The Google “MapReduce” paper came out in 2003 and Doug Cutting implemented MapReduce as part of Nutch. At some point, Yahoo picked it up and they currently have 30 plus developers working on Hadoop. Nutch has been left behind and seems moribund: it runs its version of MapReduce on an older version and it is not recommended anymore. Today you can use Hadoop to do a search engine’s crawler and indexer and Katta (Stefan’s project) can be used to serve indices using Lucene. On the other hand, if you just want something like “Google in a box” then Nutch may be a good choice.
HADOOP INTENSIVE
The course provided a wealth of detail to help attendees get up and running with Hadoop including information about development and deployment using the single JVM local version of Hadoop, the localhost cluster, and distributed cluster mode. A series of progressively more advanced labs used exercises based on the simple word-count example from the early Google MapReduce paper. In addition, the class covered Hadoop’s API, shell commands, management scripts, monitoring Hadoop using a browser, startup in “safe mode,” the HDFS file space balancer, HDFS upgrades, and much more. Chris and Stefan shared their experience providing Hadoop lore that would otherwise have to be learned the hard way.
Leave a Reply
You must be logged in to post a comment.