Four Case Studies in R
by Paul O'Rorke on Feb.18, 2009, under Meeting Notes
Michael Driscoll and Jim Porzak organized an excellent panel on “Case Studies in R” at Predictive Analytics World at the Hotel Nikko on Mason Street in San Francisco, February 18th, 2009. Actually, they couldn’t resist having fun yet again with the name “R” and the actual title was “The R and Science of Predictive Analytics: Four Case Studies in R.” Jim and Mike organize the Bay Area useR Group and this meeting was their 2009 kickoff. Driscoll is a Principal in a Business Analytics startup called Dataspora. Michael chaired the session and Jim, now at The Generations Network, gave a quick overview of R and served as one of the four panelists. The other three panelists were: Bo Cowgill from Google, Itamar Rosenn from Facebook, and David Smith from Revolution Computing.
Bo Cowgill started the case studies by sketching out how Google uses R. (Bo is in Hal Varian’s group at Google.) R is not the only statistical software used at Google: SAS, SPSS, Stata are also used, but R is the most widely used. There are 200 users at Google and the company is a corporate supporter of R.
The positive aspects of R and the major reasons R is used are:
- it does what’s needed
- it’s extensible
- it’s fully featured
- it’s a programming language
- it’s free so there is no need to have a budget for it or get approval to purchase it
The negative aspects of R are:
- it doesn’t handle huge data sets easily or well
- it is sometimes slow, for example on cluster standard errors
- the learning curve is bad, especially for people with no background in programming, for example Excel users
Bo joked that a positive aspect of R is that it was designed by statisticians. He paused and then said “A negative aspect of R is that it was designed by statisticians.” The algorithms could be more efficient and the language could be more elegant.
Google doesn’t use R “in production” or on their huge clusters. They use it on personal workstations. They do classifiers and exploratory data analysis and tune models. After developing models they typically write programs in more robust languages like C++ or Python to actually use the models to process large data sets. But the features in the models used in production will have come from R.
Google is starting to improve their R development practices by checking code in to source repositories and conducting code reviews. They are also advancing the state of the R by doing research that results in new algorithms, papers and in patent applications.
Facebook’s Itamar Rosenn provided the second case study. Facebook has 150 million users and each user has hundreds of dimensions and users take billions of actions every day on their site. All of their analyses and evaluations of experiments and prototypes are done using Hadoop map reduce and they use C++ or Python. R is used for exploratory data analysis in the early stages with less than a million observations. They use the graphics package in R extensively.
About a year ago (2007-2008) Facebook started to lose new users at a faster rate so there was an urgent need to know which features would predict whether users would stay and what level of activity they would have. They used age, sex, school, number of friends, whether someone communicated with them, the number of photo uploads and so on, collected in the first two weeks of membership, to predict whether users would stay three months and their level of activity. They used recursive partitioning (using rpart) and logistic regression. They used lasso and lars because there is such a large number of variables and attributes and behavioral features. It turned out the most important things were whether users came more than once and whether they provided their sex (male or female). They conjectured that if a user is “receptive” to facebook (e.g., willing to give personal information) they will be more likely to use it. If many people reach out to a user thru facebook then that person is more likely to stay. If a user uses third party features in addition to core features, they are more likely to stay, but if they mostly use third party features (as opposed to connecting with other people using the core facebook features) they are likely to leave.
I was reminded of a social network entrepreneur I once met who flatly stated that the purpose of social networking sites is to help people “get laid.” He had deliberately designed his application to increase the probability that might happen. I wondered whether many of the people who drop out of facebook do so because they’re looking for something like love or sex and decide they aren’t going to get it there. I questioned whether there was a scientific process in place at facebook where the “receptivity” conjecture could be tested against other hypotheses like this. Itamar claimed there is such a process but I wonder how much like science it is and whether it could be improved. Itamar mentioned that there are still problems inferring direct causal relationships. I’m sure there are lots of reasons this is difficult but also believe that there are interesting things that can be done to improve in that area.
David Smith is the Community Director for Revolution Computing. The company produces high performance parallel distributions of R that can take advantage of multiple cores and clusters and they provide superior documentation. They do consulting and training.
I suspect that Revolution Computing’s clustering solution would not serve as a cloud computing version of R. However, Jim tells me that there is a cloud computing version that has been developed in France.
Jim Porzak started with R in 2003 because he was working for a non-profit at the time. Since then, he has done customer analytics and marketing analytics: prediction, response modeling, classification, clustering, and upsell. He mentioned that although the fact that all data must be in memory limits the size of problem you can work on in some way you can often get around this by using a database. Jim uses MySQL.
ACTION ITEMS (mostly recommended by Jim):
- get John Chambers, Software for Data Analysis: Programming in R
- get tools: JGR (Java-based GUI for R), Rattle (data mining interface), Rcmndr (learning tool)
- look at CRAN task views for Machine Learning and Statistical Learning, and Multivariate Statistics
- find out more about cloud computing version of R
Leave a Reply
You must be logged in to post a comment.
March 4th, 2009 on 10:51 am
[...] On Intelligence & Software Paul O’Rorke [...]