The panel on “Analyzing Big Data” at the SDForum Analytics Conference on “The Analytics Revolution” included representatives of two companies that analyze data on a petabyte scale (Joydeep Sen Sarma, Facebook and Joshua Klahr, Yahoo!) and two software companies that stand behind open source infrastructure components that are often used to build analytics platforms (Amr Awadalla, Cloudera/Hadoop and James Phillips, Northscale/Memcached and Membase). The moderator, Owen Thomas of VentureBeat, started off by asking the panelists whether “big data” is a Silicon Valley phenomenon that will soon spread to the Fortune 500 and the rest of the world.
Amr Awadalla defined “big data” as 10 terabytes or more and noted that when Cloudera talks to prospective customers they often have “medium data” (less than 10 terabytes). Amr noted that individual nodes can have 4, 12, or soon even 24TB nodes. So “medium data” problems don’t require a cluster or Hadoop at all just to deal with the size of the data. The smallest Hadoop cluster requires three nodes. Most of Cloudera’s customers have tens to a few hundred node Hadoop clusters at the high end. Facebook and Yahoo! have clusters with thousands of nodes.
Joshua Klahr said that Yahoo! collects tens of terabytes per day of ad data and weblogs telling them what ads are successful and what content is relevant. Since introducing social applications and features, they have seen a dramatic increase in data because they are collecting text generated by users rather than just clicks.
Joydeep Sen Sarma said Facebook has 400 million users and collects over 12 terabytes of compressed data per day. Facebook has over two petabytes of data. Joydeep noted that the fact that things can now be measured that could not be measured previously and the fact that the value of web companies data per byte is much lower than for earlier kinds of companies have driven the collection of larger amounts of data and the shift toward open source platforms and tools.
Relatively few companies have petabyte scale data sets. But the issue is not so much the size of the data. The real issue is complexity. Things like Hadoop are important not just because they enable companies to work with “stupendous” data sets but also (and more importantly) because they enable companies to work with complex datasets including data that has not been organized into a RDBMS or schema, and including text and weblogs.
The consensus of the panel was that “big data” and associated technologies already are spreading and will continue to spread. Silicon Valley companies may be early developers and early adapters of technology for analyzing “Big Data” but it is spreading from web companies to other sectors including banking, games, government, and telecommunications and from the West to the East Coast and overseas.
Owen invited the panelists to comment on the NoSQL movement. Perhaps surprisingly, several panelists identified with NoSQL came out against the term in one way or another.
Amr Awadalla (Cloudera) prefers “NOSQL” (Not Only SQL) instead of “NoSQL” since more than half of all analysts use SQL. NoSQL doesn’t make sense if it argues against having SQL. The main issue for him is “agility”: the ability to make changes quickly and to be flexible with regard to how things are done. Amr breaks this down into two kinds of agility: ”agility of data types” and “agility of language.”
Amr explained his concept of “agility of data types”: When traditional RDBMs are used and rigid schemas are required, it is necessary to go thru a DBA whenever a change in the schema is required (for example, to add a new column) and this can take a long time. Ditto for loading new data into the schema: ETL is required to load the data and this can take too long. NoSQL approaches have the benefit that they allow operating without a schema or they allow for changing schemata easily and quickly.
Amr’s concept of “agility of language” is: Taking a purely SQL-based approach with traditional RDBMs’s is too inflexible. approaches based on Hadoop can go beyond SQL and allow the use of programming languages more powerful than SQL that accomodate the preferences of your developers, (e.g., Java, Python, C, Perl).
James Phillips (Northscale) identified himself as a NoSQL advocate but said it is not about SQL: it’s not about the query language. The issues are really storage, scaling, and performance. The ACID transaction guarantees provided by traditional RDBMSs come with performance and scalability costs and many applications don’t need the guarantees but rather need greater scalability and higher performance.
Joydeep said NoSQL is like a religion and he hates religion. Although Hadoop is often considered to be part of NoSQL systems, his Hive project introduced a simplified version of SQL on top of Hadoop because many analysts prefer to work with SQL. The real advance has not been to eliminate SQL but rather it is the breaking down or deconstruction of previously monolithic systems into separable components and layers. The components and layers include storage (e.g., the filesystem) and processing (e.g., indexing, mapreduce, query processing, and text processing). One can “rack and stack”and build systems out of the components according to ones needs.
Joshua said that as a Product Manager, he uses Excel extensively. And he said that Yahoo finds it easier to find SQL coders than MapReduce programmers. The consensus was that Excel and SQL are here to stay.
Owen asked the panelists how we can avoid having a “data priesthood” and how we can promote the “democratization of data.” Several panelists referred to the panel on “Competing on Analytics at the Highest Level” because the practices of the analytics competitors on that panel addressed this issue. In addition, several other ways to make data usable across the company were mentioned, for example providing tools that make it easier for people with various backgrounds and knowledge and skills to use data. For example, Hadoop is written in Java but programmers more familiar with other languages like Python and SQL programmers can also use it (e.g., using Streaming or Hive). Going forward, we will see more connections from Hadoop, Hive, and Pig to existing BI tools like Microstrategy (e.g., thru ODBC connectors under development by Cloudera and Facebook) and this should further “democratize the data.”
A recording of the panel discussion is available at dyyno.com.