<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Paul O'Rorke</title>
	<atom:link href="http://ororke.com/paul/blog/?feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://ororke.com/paul/blog</link>
	<description>On Intelligence &#38; Software</description>
	<lastBuildDate>Wed, 27 Oct 2010 04:41:10 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Tom Fawcett, A Spam Assassin</title>
		<link>http://ororke.com/paul/blog/?p=758</link>
		<comments>http://ororke.com/paul/blog/?p=758#comments</comments>
		<pubDate>Tue, 26 Oct 2010 07:01:34 +0000</pubDate>
		<dc:creator>Paul O&#39;Rorke</dc:creator>
				<category><![CDATA[Meeting Notes]]></category>
		<category><![CDATA[Data Mining]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[SFBA ACM]]></category>

		<guid isPermaLink="false">http://ororke.com/paul/blog/?p=758</guid>
		<description><![CDATA[Tom Fawcett, Machine Learning Architect at Proofpoint, gave the San Francisco Bay Area ACM Data Mining SIG an insider&#8217;s view of email filtering on Monday, October 25th, 2010.  Proofpoint has thousands of customers large and small and guarantees in their service level agreement that customers will get no more than one spam message per 350,000 [...]]]></description>
			<content:encoded><![CDATA[<p>Tom Fawcett, Machine Learning Architect at Proofpoint, gave the San Francisco Bay Area ACM Data Mining SIG an insider&#8217;s view of email filtering on Monday, October 25th, 2010.  Proofpoint has thousands of customers large and small and guarantees in their service level agreement that customers will get no more than one spam message per 350,000 emails.  Tom pointed out that research on spam filtering has little to do with what companies do in practice in the &#8220;real world&#8221; and then he revealed a lot about how commercial spam filtering works.</p>
<p><span id="more-758"></span>Everyone applies induction looking for spam in &#8220;bags of words&#8221; but researchers have favored Support Vector Machines, random forests, and ensemble methods.  Everyone then does cross validation and accuracies of 99% are common.</p>
<p>Industrial applications differ from academic research in part because spam is one of those areas where there is a continuing arms race as both spammers and filters become ever more sophisticated. Tom showed a number of examples of spam illustrating the fact that text models alone are insufficient.</p>
<p>In industry, multiple stages are used with more intensive text processing reserved for later stages after earlier simpler stages have already identified half or more of the spam.  Proofpoint uses reputation models on the IP addresses of the sender and on domains and URLs, even taking into account how long the owner has held a domain and so on.  They use <a href="http://spamassassin.apache.org/">Spam Assassin</a>, a public domain regular expression and rule based content matching system.  But that is just a small step in their overall process.  It provides some of the information used but Proofpoint&#8217;s lexicon and content models provide most of the leverage.</p>
<p>Interestingly, while most researchers use a simple representation (bag of words) with complex learning methods such as Support Vector Machines (SVM), Proofpoint uses a complex representation including context markers, AND/OR rules as well as phrases and words. They use a very simple classifier and learning method:  they use logistic regression to assign positive and negative weights to terms.  SVM, random forests, ensembles and so on don&#8217;t offer much improvement.  R^2 is about .95 so there is no need for better accuracy.  This is because Proofpoint&#8217;s representation is sophisticated and already incorporates attribute interactions, conditional dependencies and so on.  There are 1.3 million features going into the logistic regression.</p>
<p>A major advantage of having a simple classifier and using logistic regression is that it is comprehensible and transparent.  A &#8220;white box&#8221; is better for this application than opaque &#8220;black box&#8221; approaches. When something is classified as spam, an analyst can run it through the classifier and show the terms that influenced the classification, their weights, and so on.</p>
<p>Proofpoint reparses their corpus every 24 hours and updates the classifier in about 1.5 hours.  They age things out of their corpus so that it can change over time as spam changes over time.  In order to be able to respond quickly to spam attacks, they also have a special &#8220;fast attack learning and response&#8221; system that adds new terms to the model every fifteen minutes.  Usually the representation is wrong rather than the model, so it is not necessary to rebuild the model from scratch and it suffices to improve the representation by adding terms.</p>
<p>Proofpoint does well in competitions against their competitors in part because they do text analysis whereas many of their competitors only do IP and domain filtering.  In an ongoing effort to stay ahead of spammers, link mining and network analysis are now being looked at as potential sources of information that can support spam identification.</p>
]]></content:encoded>
			<wfw:commentRss>http://ororke.com/paul/blog/?feed=rss2&amp;p=758</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Flurry&#8217;s Mobile App Analytics</title>
		<link>http://ororke.com/paul/blog/?p=743</link>
		<comments>http://ororke.com/paul/blog/?p=743#comments</comments>
		<pubDate>Thu, 21 Oct 2010 09:45:15 +0000</pubDate>
		<dc:creator>Paul O&#39;Rorke</dc:creator>
				<category><![CDATA[Meeting Notes]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[BI]]></category>
		<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[SDForum BI SIG]]></category>

		<guid isPermaLink="false">http://ororke.com/paul/blog/?p=743</guid>
		<description><![CDATA[Peter Farago and Sean Byrnes gave a juicy and surprising presentation about Flurry&#8217;s mobile app analytics at the SDForum Business Intelligence Special Interest Group meeting on 10/19/2010 in Palo Alto.  The title of their presentation was:  &#8221;Your Company’s Mobile App Blind Spot&#8221; and it provided both business and technical insights.
Flurry made a big splash in the [...]]]></description>
			<content:encoded><![CDATA[<p>Peter Farago and Sean Byrnes gave a juicy and surprising presentation about <a href="http://www.flurry.com/">Flurry</a>&#8217;s mobile app analytics at the <a href="http://www.sdforum.org/index.cfm?fuseaction=Calendar.eventDetail&amp;eventId=13765&amp;pageId=620">SDForum Business Intelligence Special Interest Group meeting on 10/19/2010</a> in Palo Alto.  The title of their presentation was:  &#8221;Your Company’s Mobile App Blind Spot&#8221; and it provided both business and technical insights.</p>
<p>Flurry made a big splash in the news when Steve Jobs got pissed off at them and called them out by name in an interview because they outed Apple&#8217;s iPad when it was still a closely guarded secret. (See a short video outtake of the interview at <a href="http://venturebeat.com/2010/06/02/apple-flurry-ipad/">VentureBeat</a>.)  Apple responded by changing legal agreements to exclude some third party analytics and some advertising.</p>
<p><span id="more-743"></span><a href="http://www.appleinsider.com/articles/10/06/02/flurry_modifies_data_collection_after_being_called_out_by_steve_jobs.html">Flurry changed how they collect device data</a>.  The Department of Justice and the Federal Trade Commission began to investigate Apple around this time due to a collection of legal restrictions and potentially anti-competitive business practices which may have come to include the analytics proscription.  Apple later relaxed the analytics ban.</p>
<p>Flurry is inside about a fifth of all apps and they cover 150 million devices:  about 90% of all Android devices and 80% of all iPhones.</p>
<p>Flurry&#8217;s analytics works as follows:</p>
<ol>
<li>Developers download Flurry&#8217;s SDK and wire it into their app (~ 5 minutes).</li>
<li>The instrumented app sends data via the internet to Flurry&#8217;s servers.</li>
<li>The data is processed using Hadoop on HBase and  stored in &#8220;giant cubes.&#8221;</li>
<li>Clients can then get metrics from a data warehouse thru a web service at Flurry.com.</li>
</ol>
<p>Flurry provides a free service and then makes money with a product called <a href="http://www.flurry.com/product/appcircle/index.html">AppCircle</a>.  AppCircle combines an app recommender with rewards that use virtual currency.  AppCircle recommends another app to someone who already has a different app just as Amazon and others recommend a new purchase given that you have already bought or you are about to buy a given item.  Flurry makes money by splitting sales commissions with publishers roughly evenly if the app is installed after they recommend it.  One of the advantages they have is that they know which apps you have from the same publisher so they  will not recommend something you already have.</p>
<p>Flurry&#8217;s analytics service has accumulated 30TB of data since the company was launched.  They compress data on the client prior to sending it to their servers.  The core IP for this was developed earlier when the founders had to do compression to reduce data sizes on earlier handsets that were much less powerful than today&#8217;s handsets.  Flurry has one of the largest HBase databases.  Six servers receive reports from devices and store and forward them so they can be processed into their HBase db.</p>
<p>Perhaps the most interesting observation that was made with Flurry analytics is that the level of &#8220;engagement&#8221; (defined as frequency of use x the length of session) is far higher for iPhones than it is for other handsets.  iPhone apps are used far more often and for much longer periods of time (for example, people with iPhones use them to read far more than others do).</p>
<p>Perhaps the most important observation Flurry made is that there are 5k Android app developers and 100k apps in Android market versus 35k AppStore app developers and 300k apps in the AppStore. Apple is making money for developers in part by rewarding them with previously unheard of revenue sharing (70%).  For the Android marketplace to become viable, catch up, and surpass Apple it must attract app developers by making money for them.  Currently, the Android market is broken and it must be fixed.  Credit card scams occur in the Android store and they only go away when people vote them out (the customers are the only police) and there are a lot of other problems with the store that make it harder for app publishers to make money there.  One issue is that Android marketplace does not have a large paying customer base.  Apple has 150 million consumer credit cards (only Amazon and eBay have more).  Android is trying to work on this by working a deal with PayPal.</p>
<p>Currently, app developers and publishers &#8220;monetize&#8221; (make money) in several ways:</p>
<ul>
<li>customers pay up front;</li>
<li>apps are free but they get you later:
<ul>
<li>advertising</li>
<li>micro payments</li>
<li>virtual currency</li>
</ul>
</li>
<li>app stores do not support subscriptions</li>
</ul>
<p>Carrier&#8217;s stores failed in part because of the ways they tried to make money.  For example, typically nothing was free.</p>
<p>Flurry observed that Apple&#8217;s AppStore achieved the adaption levels achieved by the iTune store in 1/3 the time. Also, currently, the percentage of game apps is much higher in the AppStore as compared to Android market.  It may be that this is partly due to the fact that Apple regulates categories in the AppStore while Android allows developers or publishers to self-report the categories of their apps.  This is a data quality issue for analytics.</p>
<p>Apple provides basic analytics for example a top downloads list and monthly store statistics.  Android apps can only see how many downloads occur if they have Flurry inside.</p>
<p>Flurry pointed out that it is important to have users login with a userid and password so that you can tie their activity together (rather than trying to use the handset&#8217;s Unique Device Identification Number).  They also noted that &#8220;phones lie&#8221; about the current time:  10% had &#8220;bad clocks&#8221; reporting dates in the 1970&#8217;s or 2200AD.  Flurry also deals with duplicate data coming in and de-duplicates it.  And they use some form of machine learning to identify devices, carriers, and countries.</p>
<p>In addition to analytics and recommendation, Flurry does clustering (especially for Games).</p>
<p>Some interesting opinions about Google surfaced.  The Nexus One was a commercial flop because you had to pay $500 to buy it online and then you still had to sign up with a carrier (the carrier didn&#8217;t &#8220;come with it&#8221;).  Further, Google doesn&#8217;t know how to &#8220;cuddle&#8221; a consumer.  They are engineering-oriented to a fault.  Andy Rubin&#8217;s response to Steve Job&#8217;s comments about Android (about it not being &#8220;open&#8221;) was to tweet Unix code for downloading Android, a perfect example.</p>
<p>To show how Flurry can be used, examples were given of ad network performance and tracking cross selling.  In the case of cross selling multi-dimensional scaling was used to highlight differences and similarities between items (apps in a portfolio).</p>
<p>A member of the audience asked whether Jobs can or might do the same things Flurry is doing.  The response was that any platform provider can build something up and cut out earlier third parties.  So there is some risk of that happening.</p>
]]></content:encoded>
			<wfw:commentRss>http://ororke.com/paul/blog/?feed=rss2&amp;p=743</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Salesforce&#8217;s Realtime Analytics</title>
		<link>http://ororke.com/paul/blog/?p=722</link>
		<comments>http://ororke.com/paul/blog/?p=722#comments</comments>
		<pubDate>Wed, 19 May 2010 00:37:56 +0000</pubDate>
		<dc:creator>Paul O&#39;Rorke</dc:creator>
				<category><![CDATA[Meeting Notes]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Cloud Computing]]></category>
		<category><![CDATA[SDForum BI SIG]]></category>

		<guid isPermaLink="false">http://ororke.com/paul/blog/?p=722</guid>
		<description><![CDATA[Salesforce&#8217;s CRM analytics architect, Donovan Schneider, presented an overview at the SDForum BI SIG meeting on May 18th, 2010.  Salesforce&#8217;s view of analytics is that it should deliver insight that is accessible to mere mortals, real-time, and trustworthy.
Salesforce&#8217;s analytics strategy is to start simple and grow from there.  Donovan counted dashboards and reports as analytics [...]]]></description>
			<content:encoded><![CDATA[<p>Salesforce&#8217;s CRM analytics architect, Donovan Schneider, presented an overview at the SDForum BI SIG meeting on May 18th, 2010.  Salesforce&#8217;s view of analytics is that it should deliver insight that is accessible to mere mortals, real-time, and trustworthy.</p>
<p><span id="more-722"></span>Salesforce&#8217;s analytics strategy is to start simple and grow from there.  Donovan counted dashboards and reports as analytics (in contrast to others who define analytics as what you have left when you subtract ETL and reporting from BI). He noted that Salesforce users have built three quarters of a million dashboards and dashboards are viewed 700,000 times per day.  There are over 12 million reports and 2.5 million reports are run each day.  In addition to dashboards and reports, Donovan included list views and search in the collection of analytics tools currently available.  Future additions to analytics will build from this collection instead of adding relatively high end analytics.</p>
<p>Donovan showed a nice architecture slide contrasting the traditional datawarehouse based approach to the new cloud-based approach taken by Salesforce and claimed that the datawarehouse approach is out of date, too complicated, and too rigid.  Salesforce&#8217;s new approach takes advantage of their cloud-based force.com API, multitenant architecture and platform so it is easy to use, realtime, and flexible.</p>
<p>Donovan gave a brief review of Salesforce&#8217;s multi-tenant architecture. Craig Weissman, Salesforce&#8217;s CTO, gave a presentation on the multi-tenant architecture for SDForum&#8217;s Software Architecture SIG back when I helped organize that group. Click <a href="http://ororke.com/paul/blog/?p=174">here</a> for more on this topic.</p>
<p>Salesforce achieves realtime analytics and simplification by not having a datawarehouse.  But unlike many companies using NoSQL data stores Salesforce does use a traditional RDBMS: Oracle. Currently, there are around fifteen databases for production each containing around ten terabytes. Salesforce relies on ACID transactions because they are supporting companies doing business on their platform.  When transactions are committed, the changes are available immediately everywhere and this supports the goal of realtime analytics.</p>
<p>Interestingly, Salesforce does not cache results of queries and so on because things change so frequently.  Salesforce is able to get away without caching and does without a datawarehouse in part by making query execution efficient.  Donovan went into some depth on how queries are optimized.  Salesforce has its own optimizer and does its own indexing in part because the multi-tenant architecture of their data often makes it impossible to directly use Oracle&#8217;s indexing and optimization.</p>
<p>In closing, Donovan said his top priorities now are adding analytical capabilities, improving scalability and usability, and supporting collaboration.  You can try out analytics for free as it is available in Salesforce&#8217;s sandbox along with a new report builder.  Donovan&#8217;s slides are available <a title="here" href="http://ororke.com/paul/blog/wp-content/uploads/2010/05/100518SalesforceAnalytics.pptx">here</a> and at <a href="http://sdforum.org/index.cfm?fuseaction=Page.viewPage&amp;pageId=620&amp;parentID=483&amp;nodeID=1">SDForum.org</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://ororke.com/paul/blog/?feed=rss2&amp;p=722</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Who (The Book)</title>
		<link>http://ororke.com/paul/blog/?p=348</link>
		<comments>http://ororke.com/paul/blog/?p=348#comments</comments>
		<pubDate>Thu, 29 Apr 2010 23:06:32 +0000</pubDate>
		<dc:creator>Paul O&#39;Rorke</dc:creator>
				<category><![CDATA[Reviews]]></category>

		<guid isPermaLink="false">http://ororke.com/paul/blog/?p=348</guid>
		<description><![CDATA[The book &#8220;Who:  The A Method for Hiring&#8221; describes how companies can stop using voodoo hiring methods and start using more rational and systematic methods.  It is based on thousands of interviews of CEOs and other hiring managers and is based on the idea that you should hire specialists, not generalists:  the [...]]]></description>
			<content:encoded><![CDATA[<p>The book <a href="http://whothebook.com/">&#8220;Who:  The A Method for Hiring&#8221;</a> describes how companies can stop using voodoo hiring methods and start using more rational and systematic methods.  It is based on thousands of interviews of CEOs and other hiring managers and is based on the idea that you should hire specialists, not generalists:  the right person for the right spot at the right time.</p>
<p>When hiring, consider <em>what do you want the person you hire to accomplish?</em> Create a scorecard comprised of mission, outcomes, and competencies:</p>
<ul>
<li>mission:  the job&#8217;s core purpose;  why the role exists (in 1-5 sentences)</li>
<li>outcomes:  what must get done for A performance (not what the person will be doing but rather 3-8 specific objectives)</li>
<li><a href="http://ororke.com/paul/blog/?page_id=357">competencies</a>:  arbitrarily many role based competencies that describe behaviors that must be demonstrated to achieve outcomes plus 5-8 that describe the company culture</li>
</ul>
<p>The scorecard ties the two legs of the &#8220;A Method&#8221; for hiring, &#8220;Source&#8221; and &#8220;Select&#8221; together.<span id="more-348"></span></p>
<p>Four layers of interviews are used to select people:  the (phone) <em>screening</em> interview, the <em>top-grading</em>, <em>focused</em> and <em>reference</em> interviews.</p>
<p>The screening interview has four main questions:</p>
<ul>
<li>What are your career goals?</li>
<li>What are you really good at professionally?  (8-12 positives with examples)</li>
<li>What are you <strong>not</strong> good at or not interested in doing professionally?</li>
<li>Who were your last five bosses and how will they each rate your performance on a 1-10 scale <strong>when</strong> we  talk with them?</li>
</ul>
<p>The top-grading interview covers the last 15 years from the first company to the most recent one:</p>
<ul>
<li>What were you hired to do?</li>
<li>What accomplishments are you most proud of?</li>
<li>What were some low points during that job?</li>
<li>Who were the people you worked with?
<ul>
<li>What was your boss&#8217;s name?  What was it like working with him/her?  What will he/she tell me were your biggest strengths and areas for improvement?</li>
<li>How would you rate the team you inherited on an A, B, C scale?  What changes did you make?  Did you hire anybody?  Did you fire anybody?  How would you rate the team on an A, B, C scale when you left?</li>
</ul>
</li>
<li>Why did you leave that job?</li>
</ul>
<p>Check whether the accomplishments they were most proud of connected to the desired results, the mission and outcomes they were hired to do.  Drill down into the low points, especially if they can&#8217;t think of any or claim there weren&#8217;t any:  What went wrong?  What was their biggest mistake?  What part of the job did they not like?  In what ways were their peers stronger?  Asking for the names of people they worked with establishes the TORC (Threat of Reference Check) and improves veracity of responses.  Watch for whether they are disrespectful of their previous bosses.  Check for &#8220;push versus pull&#8221;:  do not hire candidates pushed out of jobs more than 20% of the time.</p>
<p>Allow time for discussion of aspirations and career goals and for questions.  To keep things on course, interrupt without shutting people up and use the &#8220;three P&#8217;s&#8221; asking for comparisons with previous, plans, and peers.</p>
<p>A focused interview focuses on the following statement and questions:</p>
<ul>
<li>&#8220;The purpose of this interview is to talk about ___ (outcome or competency.&#8221;</li>
<li>What are your biggest accomplishments in this area in your career?</li>
<li>What are your insights into your biggest mistakes and lessons learned in this area?</li>
</ul>
<p>A reference interview checks references:</p>
<ul>
<li>In what context did you work with this person?</li>
<li>What were the person&#8217;s biggest strengths?</li>
<li>What were the person&#8217;s biggest areas for improvement <strong>back then</strong>?</li>
<li>How would you rate the performance in that job on a 1-10 scale?  What about his/her performance causes you to give that rating?</li>
<li>The person mentioned that he/she struggled with ___ on that job.  Can you tell me more about that?</li>
</ul>
<p>After collecting information in interviews, one can then decide and act.  The &#8220;Who&#8221; method looks for alignment or a good match (visualized in terms of concentric circles as in an archery target) between the candidate&#8217;s skillset and will and the scorecard.  In this stage, avoid red flags like behavior warning signs (Marshall Goldsmith) including avoiding responsibility, blaming others, making excuses and making negative comments about previous colleagues.</p>
<p>A summary of the overall process for selecting &#8220;A&#8221; players is:</p>
<ol>
<li>screening interview</li>
<li>top-grading interview</li>
<li>focused interview(s)</li>
<li>candidate discussion</li>
<li>reference interview(s)</li>
<li>final decision (skill will analysis = bulls eye?)</li>
</ol>
<p>Once a candidate is selected, you must &#8220;sell&#8221; them on joining by addressing fit (between the company&#8217;s vision, needs, and culture and the candidate&#8217;s goals, strengths,and values), family, freedom, fortune, and fun.</p>
<p>The process described by this book and summarized here seems far better than the relatively irrational and random processes used by most companies to hire new employees.  However, the book frequently makes assumptions that are open to question.  For example, the book assumes that it is best to hire people who fit specific job requirements that a company has at a given time.  This assumption is a basis for the use of scorecards.  Google is currently employing a hiring strategy that appears to be successful that contradicts the specialist approach.  Google&#8217;s strategy is to hire &#8220;smart generalists&#8221; who can work on a variety of different projects over time instead of hiring specialists.</p>
<p>Assumptions like these reflect core values of companies and their management and not just hypotheses about what will work best.  For example, companies that minimize costs and headcount and that assume they can fire workers as soon as their skills are no longer needed and hire new workers with newly relevant skills have less interest in hiring generalists.  But a company that wants to hire people and to avoid firing them for ethical or humanitarian reasons, just because corporate goals, projects and technologies change, may want to hire more broadly skilled individuals.</p>
<p>I would like to see analytics and a more scientific approach used to evaluate alternative hiring methods and their underlying assumptions.    My guess is that different strategies work for different kinds of jobs in different companies and in different industries.</p>
]]></content:encoded>
			<wfw:commentRss>http://ororke.com/paul/blog/?feed=rss2&amp;p=348</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Combining Performance and Decision Management</title>
		<link>http://ororke.com/paul/blog/?p=326</link>
		<comments>http://ororke.com/paul/blog/?p=326#comments</comments>
		<pubDate>Tue, 20 Apr 2010 18:59:10 +0000</pubDate>
		<dc:creator>Paul O&#39;Rorke</dc:creator>
				<category><![CDATA[Meeting Notes]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[BI]]></category>
		<category><![CDATA[SDForum BI SIG]]></category>

		<guid isPermaLink="false">http://ororke.com/paul/blog/?p=326</guid>
		<description><![CDATA[James Taylor, CEO of Decision Management Solutions, gave a talk on &#8220;Performance Management and Agility&#8221; at the monthly meeting of the SDForum BI SIG on Tuesday, April 20th.  He argued that traditional BI and performance management result in dashboards that measure and monitor like instrument clusters in cars.  But what is needed is [...]]]></description>
			<content:encoded><![CDATA[<p>James Taylor, CEO of <a href="http://decisionmanagementsolutions.com/">Decision Management Solutions</a>, gave a talk on &#8220;Performance Management and Agility&#8221; at the monthly meeting of the SDForum BI SIG on Tuesday, April 20th.  He argued that traditional BI and performance management result in dashboards that measure and monitor like instrument clusters in cars.  But what is needed is something more like the cockpits in airplanes:  there should be buttons and levers and so on that enable the &#8220;pilot&#8221; to act on the information presented by the dashboard.  James argued for combining performance management with decision management (a term he pioneered) so that information supports decision-making that leads to action.<br />
<span id="more-326"></span><br />
The analogy with airplanes, pilots, and cockpits provided several additional useful examples since planes have autopilots and pilots learn on simulators.  James claimed that decision management can help automate some decisions (providing an autopilot) and can allow for simulation supporting experimentation.</p>
<p>Agility was a key theme of the presentation and part of James&#8217; approach is intended to support quick actions.  When an event occurs, in some cases it is important to notice quickly and act almost immediately as the value of action based on the information about the event decreases rapidly as time passes.  With other events, you have more time to consider options.  But the key is to notice and act appropriately in a timely manner.  It isn&#8217;t enough just to make information available or notice things quickly or even in real time.  Sometimes noticing immediately isn&#8217;t even necessary or useful (e.g., when the required action would take a long time anyway).</p>
<p>James uses a three stage approach to improving operational decisions:</p>
<ol>
<li>discover the important decisions</li>
<li>build (automate) decision services</li>
<li>analyze decisions and create a closed loop between analytics and decision services.</li>
</ol>
<p>The slides for James&#8217;s presentation are available at the BI SIG&#8217;s web page at <a href="http://SDForum.org/BISIG">http://SDForum.org/BISIG</a> and <a href='http://ororke.com/paul/blog/wp-content/uploads/2010/04/100420Performance-Management-and-Agilty-SDForum.pptx'>here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://ororke.com/paul/blog/?feed=rss2&amp;p=326</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>The Analytics Revolution</title>
		<link>http://ororke.com/paul/blog/?p=293</link>
		<comments>http://ororke.com/paul/blog/?p=293#comments</comments>
		<pubDate>Sat, 10 Apr 2010 06:59:00 +0000</pubDate>
		<dc:creator>Paul O&#39;Rorke</dc:creator>
				<category><![CDATA[Meeting Notes]]></category>

		<guid isPermaLink="false">http://ororke.com/paul/blog/?p=293</guid>
		<description><![CDATA[The first SDForum conference on analytics, &#8220;The Analytics Revolution,&#8221; was held in Mountain View on Friday, April 9th, 2010.  The conference focused on recent advances in analytics, new opportunities afforded by these advances, and ways companies can take advantage of the analytics revolution in progress.
Over 250 people attended. Some leading analytics gurus and data [...]]]></description>
			<content:encoded><![CDATA[<p>The first SDForum conference on analytics, &#8220;The Analytics Revolution,&#8221; was held in Mountain View on Friday, April 9th, 2010.  The conference focused on recent advances in analytics, new opportunities afforded by these advances, and ways companies can take advantage of the analytics revolution in progress.</p>
<p><span id="more-293"></span>Over 250 people attended. Some leading analytics gurus and data scientists gave presentations and participated in panel discussions.  This blog post links to summaries of the keynotes and panels.</p>
<p>There were four panel discussions at the conference:</p>
<ul>
<li><a href="http://ororke.com/paul/blog/?p=483">Competing on Analytics</a></li>
<li><a href="http://ororke.com/paul/blog/?p=489">Analyzing Big Data</a></li>
<li><a href="http://ororke.com/paul/blog/?p=573">New Frontiers</a></li>
<li><a href="http://ororke.com/paul/blog/?p=564">The Investor Perspective</a></li>
</ul>
<p>Keynote speakers representing an array of large companies with strong analytics capabilities gave presentations on a wide range of analytics efforts:</p>
<ul>
<li>Ronny Kohavi (Microsoft) &#8220;<a href="http://ororke.com/paul/blog/?p=530">Online Controlled Experiments:  Listening to the Customers, Not to the HiPPO</a>&#8220;</li>
<li>Sanjay Poonen (SAP) &#8220;<a href="http://ororke.com/paul/blog/?p=506">Leading the Analytics Revolution</a>&#8220;</li>
<li>Peter Norvig (Google) &#8220;<a href="http://ororke.com/paul/blog/?p=295">The Unreasonable Effectiveness of Data</a>&#8220;</li>
<li>Jeff Kreulen (IBM) &#8220;<a href="http://ororke.com/paul/blog/?p=556">Analytics:  An Applied Researcher&#8217;s Perspective</a>&#8220;</li>
<li>Jaap Suermondt (HP) &#8220;<a href="http://ororke.com/paul/blog/?p=560">Research in Analytics for Operational Impact at HP</a>&#8220;</li>
</ul>
<p>You can click on the links above to get quick summaries.  The keynote presentations and panel discussions were recorded and the audio and video recordings are available individually from the pages above or at <a href="http://www.dyyno.com/channel/sdforum">dynno.com</a> or at SDForum <a href="http://sdforum.org/index.cfm?fuseaction=Page.ViewPage&amp;PageID=997">here</a>.  The conference was sponsored by Accel Partners, Dynno, IBM, Impetus, Microsoft and SAP and the organizing committee included Stacey Bishop (Scale Venture Partners),<strong> </strong>Jim Claussen (IBM), Lars Leckie (Hummer Winblad), Sonja London (Summit Software), Paul O&#8217;Rorke (Kraken Data), and Julia Queck (SAP).</p>
]]></content:encoded>
			<wfw:commentRss>http://ororke.com/paul/blog/?feed=rss2&amp;p=293</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Analytics:  Competing on Analytics at the Highest Level</title>
		<link>http://ororke.com/paul/blog/?p=483</link>
		<comments>http://ororke.com/paul/blog/?p=483#comments</comments>
		<pubDate>Sat, 10 Apr 2010 05:37:59 +0000</pubDate>
		<dc:creator>Paul O&#39;Rorke</dc:creator>
				<category><![CDATA[Meeting Notes]]></category>

		<guid isPermaLink="false">http://ororke.com/paul/blog/?p=483</guid>
		<description><![CDATA[The &#8220;Competing on Analytics&#8221; panel at the SDForum Conference on &#8220;The Analytics Revolution&#8221; included people from companies using analytics to &#8220;compete at the highest level&#8221; according to the five stage maturity model in the book &#8220;Competing on Analytics: The New Science of Winning.&#8221;  The panelists (Amr Awadallah, Cloudera; Joshua Klahr, Yahoo!; James Phillips, Northscale; Joydeep Sen [...]]]></description>
			<content:encoded><![CDATA[<p>The &#8220;Competing on Analytics&#8221; panel at the SDForum Conference on &#8220;The Analytics Revolution&#8221; included people from companies using analytics to &#8220;compete at the highest level&#8221; according to the five stage maturity model in the book &#8220;<a href="http://www.amazon.com/Competing-Analytics-New-Science-Winning/dp/1422103323">Competing on Analytics: The New Science of Winning</a>.&#8221;  The panelists (Amr Awadallah, Cloudera; Joshua Klahr, Yahoo!; James Phillips, Northscale; Joydeep Sen Sarma, Facebook) represented a good mix from the relatively new Twitter to the larger, older, more established eBay.  David Steier, PriceWaterhouseCoopers, moderated the panel.</p>
<p><span id="more-483"></span>David began by asking:  how do the panelists use analytics to get a competitive advantage?  What is an example of something they found that surprised them?</p>
<p>Ken Rudin and Zynga use analytics to achieve their goals of increasing revenue, improving user retention and increasing viral spread for their online social games including Farmville.  They collect 3-4 terabytes of data from their 40-50 million daily users.  Initially they were reactive:  producing reports in response to requests.  Now they use analytics as an integral part of game design:  they use AB testing and experiments in development just as QA is an integral part of development.  Analysts are part of the design team.  An example of something that surprised them occurred when they analyzed Mafia game players.  There are two groups of players, the &#8220;crime jobbers&#8221; and the &#8220;fighters.&#8221;  They discovered that fighters spend twice as much because they are trying to compete with their friends so they purchase weapons to arm their mafia members so they can fight and defeat their friend&#8217;s mafia gang.  Since they discovered this, they have changed the game to encourage players to fight more instead of just doing crime jobs.</p>
<p>Kevin Weil mentioned that a key accomplishment of the analytics team at Twitter has been to get everyone to consider data (and to make it possible for them to do so) in all important decisions.  He described an example of how analytics surprised them that involved the social network underlying Twitter.  Many people use Twitter largely as an information source and their network of &#8220;following&#8221; links says something about their interests.  But the bidirectional links can be ignored when trying to determine their interests.  The unidirectional links (e.g., the people they follow who don&#8217;t follow them) are the ones that carry the most important incoming information.</p>
<p>David Steier asked the panelists:  How do you organize people to achieve the company&#8217;s goals?  The panelists companies all have analytics teams and a team responsible for the company&#8217;s analytics platform and analysts who work with other teams but the way they work within the companies is different and several companies are adapting innovative approaches.</p>
<p>DJ Patil and LinkedIn started out by looking at other companies such as Google and Yahoo.  Yahoo had analytics in a separate research organization and it was difficult or impossible to get it into products.  Google is driven by technology and bolts products on top (see also <a href="http://techcrunch.com/2010/05/15/facebook-google/">http://techcrunch.com/2010/05/15/facebook-google/</a>).  So to ensure that analytics is integrated into products at LinkedIn, the Analytics team is a substantial part (1/4th) of the product team and has its own designers and developers so it is easier to go straight to production or to integrate with existing products.</p>
<p>Neel Sundaresan (eBay) claimed that &#8220;everybody should be an analytics scientist.&#8221;  eBay has an analytics platform team that provides data to the rest of the company and &#8220;the data tells you what the product should be.&#8221;  With 200 million users and a billion searches per day, eBay gets tremendous amounts of data and product managers and developers and even some machine learning scientists need to learn to look at the data.</p>
<p>Ken Rudin (Zynga) argued that the whole idea of having an &#8220;analytics team&#8221; is flawed.  He asked:  &#8221;Does Microsoft have an internet division?&#8221;  Like the internet, analytics is or should be fundamental to everything in the company.  So his goal is to work himself out of a job by putting analysts in development and product teams and by training almost everyone in the company in analytics starting with product managers, then engineers, and then quality assurance.</p>
<p>In summary, one of the key themes of the panel was that companies competing on analytics are finding innovative ways to integrate analytics throughout their companies.  One of the biggest problems companies face these days is that it is difficult to find good analytics people or data scientists.  Ken Rudin is looking for different kinds of people now as compared to five years ago.  Now he is looking for people with analytics and business abilities.  So for example, instead of just taking data that is given to you and looking for interesting patterns, you should be able to take a business goal like &#8220;increase player retention&#8221; and figure out how to do it, what data you need, and so on.</p>
<p>A recording of this panel is available at <a href="http://www.dyyno.com/channel/sdforum#vod=1969">dyyno.com</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://ororke.com/paul/blog/?feed=rss2&amp;p=483</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Analytics:  Analyzing &#8220;Big Data&#8221;</title>
		<link>http://ororke.com/paul/blog/?p=489</link>
		<comments>http://ororke.com/paul/blog/?p=489#comments</comments>
		<pubDate>Sat, 10 Apr 2010 04:44:43 +0000</pubDate>
		<dc:creator>Paul O&#39;Rorke</dc:creator>
				<category><![CDATA[Meeting Notes]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[SDForum]]></category>

		<guid isPermaLink="false">http://ororke.com/paul/blog/?p=489</guid>
		<description><![CDATA[The panel on &#8220;Analyzing Big Data&#8221; at the SDForum Analytics Conference on &#8220;The Analytics Revolution&#8221; included representatives of two companies that analyze data on a petabyte scale (Joydeep Sen Sarma, Facebook and Joshua Klahr, Yahoo!) and two software companies that stand behind open source infrastructure components that are often used to build analytics platforms (Amr [...]]]></description>
			<content:encoded><![CDATA[<p>The panel on &#8220;Analyzing Big Data&#8221; at the SDForum Analytics Conference on &#8220;<a href="http://ororke.com/paul/blog/?p=293">The Analytics Revolution</a>&#8221; included representatives of two companies that analyze data on a petabyte scale (Joydeep Sen Sarma, Facebook and Joshua Klahr, Yahoo!) and two software companies that stand behind open source infrastructure components that are often used to build analytics platforms (Amr Awadalla, Cloudera/Hadoop and James Phillips, Northscale/Memcached and Membase).  The moderator, Owen Thomas of VentureBeat, started off by asking the panelists whether &#8220;big data&#8221; is a Silicon Valley phenomenon that will soon spread to the Fortune 500 and the rest of the world.</p>
<p><span id="more-489"></span>Amr Awadalla defined &#8220;big data&#8221; as 10 terabytes or more and noted that when Cloudera talks to prospective customers they often have &#8220;medium data&#8221; (less than 10 terabytes). Amr noted that individual nodes can have 4, 12, or soon even 24TB nodes.  So &#8220;medium data&#8221; problems don&#8217;t require a cluster or Hadoop at all just to deal with the size of the data.  The smallest Hadoop cluster requires three nodes. Most of Cloudera&#8217;s customers have tens to a few hundred node Hadoop clusters at the high end.  Facebook and Yahoo! have clusters with thousands of nodes.</p>
<p>Joshua Klahr said that Yahoo! collects tens of terabytes per day of ad data and weblogs telling them what ads are successful and what content is relevant.  Since introducing social applications and features, they have seen a dramatic increase in data because they are collecting text generated by users rather than just clicks.</p>
<p>Joydeep Sen Sarma said Facebook has 400 million users and collects over 12 terabytes of compressed data per day.  Facebook has over two petabytes of data.  Joydeep noted that the fact that things can now be measured that could not be measured previously and the fact that the value of web companies data per byte is much lower than for earlier kinds of companies have driven the collection of larger amounts of data and the shift toward open source platforms and tools.</p>
<p>Relatively few companies have petabyte scale data sets.  But the issue is not so much the size of the data.  The real issue is complexity.  Things like Hadoop are important not just because they enable companies to work with &#8220;stupendous&#8221; data sets but also (and more importantly) because they enable companies to work with complex datasets including data that has not been organized into a RDBMS or schema, and including text and weblogs.</p>
<p>The consensus of the panel was that &#8220;big data&#8221; and associated technologies already are spreading and will continue to spread.  Silicon Valley companies may be early developers and early adapters of technology for analyzing &#8220;Big Data&#8221; but it is spreading from web companies to other sectors including banking, games, government, and telecommunications and from the West to the East Coast and overseas.</p>
<p>Owen invited the panelists to comment on the NoSQL movement. Perhaps surprisingly, several panelists identified with NoSQL came out against the term in one way or another.</p>
<p>Amr Awadalla (Cloudera) prefers &#8220;NOSQL&#8221; (Not Only SQL) instead of &#8220;NoSQL&#8221; since more than half of all analysts use SQL. NoSQL doesn&#8217;t make sense if it argues against having SQL.  The main issue for him is &#8220;agility&#8221;:  the ability to make changes quickly and to be flexible with regard to how things are done.  Amr breaks this down into two kinds of agility:  &#8221;agility of data types&#8221; and &#8220;agility of language.&#8221;</p>
<p>Amr explained his concept of &#8220;agility of data types&#8221;:  When traditional RDBMs are used and rigid schemas are required, it is necessary to go thru a DBA whenever a change in the schema is required (for example, to add a new column) and this can take a long time.  Ditto for loading new data into the schema:  ETL is required to load the data and this can take too long.  NoSQL approaches have the benefit that they allow operating without a schema or they allow for changing schemata easily and quickly.</p>
<p>Amr&#8217;s concept of &#8220;agility of language&#8221; is:  Taking a purely SQL-based approach with traditional RDBMs&#8217;s is too inflexible.  approaches based on Hadoop can go beyond SQL and allow the use of programming languages more powerful than SQL that accomodate the preferences of your developers, (e.g., Java, Python, C, Perl).</p>
<p>James Phillips (Northscale) identified himself as a NoSQL advocate but said it is not about SQL:  it&#8217;s not about the query language.   The issues are really storage, scaling, and performance.  The ACID transaction guarantees provided by traditional RDBMSs come with performance and scalability costs and many applications don&#8217;t need the guarantees but rather need greater scalability and higher performance.</p>
<p>Joydeep said NoSQL is like a religion and he hates religion.  Although Hadoop is often considered to be part of NoSQL systems, his Hive project introduced a simplified version of SQL on top of Hadoop because many analysts prefer to work with SQL.  The real advance has not been to eliminate SQL but rather it is the breaking down or deconstruction of previously monolithic systems into separable components and layers. The components and layers include storage (e.g., the filesystem) and processing (e.g., indexing, mapreduce, query processing, and text processing). One can &#8220;rack and stack&#8221;and build systems out of the components according to ones needs.</p>
<p>Joshua said that as a Product Manager, he uses Excel extensively.  And he said that Yahoo finds it easier to find SQL coders than MapReduce programmers.  The consensus was that Excel and SQL are here to stay.</p>
<p>Owen asked the panelists how we can avoid having a &#8220;data priesthood&#8221; and how we can promote the &#8220;democratization of data.&#8221;  Several panelists referred to the panel on &#8220;Competing on Analytics at the Highest Level&#8221; because the practices of the analytics competitors on that panel addressed this issue.  In addition, several other ways to make data usable across the company were mentioned, for example providing tools that make it easier for people with various backgrounds and knowledge and skills to use data.  For example, Hadoop is written in Java but programmers more familiar with other languages like Python and SQL programmers can also use it (e.g., using Streaming or Hive).  Going forward, we will see more connections from Hadoop, Hive, and Pig to existing BI tools like Microstrategy (e.g., thru ODBC connectors under development by Cloudera and Facebook) and this should further &#8220;democratize the data.&#8221;</p>
<p>A recording of the panel discussion is available at <a href="http://www.dyyno.com/channel/sdforum#vod=1972">dyyno.com</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://ororke.com/paul/blog/?feed=rss2&amp;p=489</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Analytics:  Online Controlled Experiments &#8211; Listening to the Customers, Not to the HiPPO</title>
		<link>http://ororke.com/paul/blog/?p=530</link>
		<comments>http://ororke.com/paul/blog/?p=530#comments</comments>
		<pubDate>Sat, 10 Apr 2010 04:14:17 +0000</pubDate>
		<dc:creator>Paul O&#39;Rorke</dc:creator>
				<category><![CDATA[Meeting Notes]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[SDForum]]></category>

		<guid isPermaLink="false">http://ororke.com/paul/blog/?p=530</guid>
		<description><![CDATA[Ronny Kohavi (Microsoft) started out by telling a famous true story about Greg Linden&#8217;s experience moving a recommender to the shopping cart at Amazon.  A Senior VP of Marketing vetoed Greg&#8217;s proposal fearing that it would distract customers from checking out and paying for the items already in their shopping basket reducing conversion.  This is [...]]]></description>
			<content:encoded><![CDATA[<p>Ronny Kohavi (Microsoft) started out by telling a <a href="http://glinden.blogspot.com/2006/04/early-amazon-shopping-cart.html">famous true story</a> about Greg Linden&#8217;s experience moving a recommender to the shopping cart at Amazon.  A Senior VP of Marketing vetoed Greg&#8217;s proposal fearing that it would distract customers from checking out and paying for the items already in their shopping basket reducing conversion.  This is where the &#8220;HiPPO&#8221; in the title of Ronny&#8217;s presentation comes from.  It stands for the &#8220;Highest Paid Person&#8217;s Opinion&#8221; and sometimes for the person (e.g., the VP) holding the opinion.  The Amazon story had a happy ending because Jeff Bezos had established a corporate culture that allowed for experiments to be run so Greg was able to run an experiment to test the hypothesis of the HiPPO.  It turned out that conversions did indeed drop but the increased revenue due to customers purchasing recommended items was substantially greater than the loss.</p>
<p><span id="more-530"></span>The online controlled experiments advocated by Ronny including AB tests are like the trials used to test drugs and get at the causes of observed effects since all the experimental subjects are exposed to the same non-causal factors. Ronny ran the audience through a series of examples showing how difficult it is to make correct decisions about a series of alternative web page designs.  Since people are bad at evaluating proposals, especially in evaluating more novel innovations, it is important to test a lot of ideas, &#8220;fail fast&#8221; and try again quickly instead of doing elaborate planning and preparation in advance.</p>
<p>In Ronny&#8217;s years of experience, he has observed that people and organizations go through stages:  they go from hubris to getting insights through measurement followed by the &#8220;Semmelweis Reflex&#8221; and then fundamental understanding.  Initially, people are sure they know it all but then they realize it helps to do experiments and take measurements to get data.  The <a href="http://en.wikipedia.org/wiki/Semmelweis_reflex">Semmelweis reflex</a> is the reflex-like rejection of new knowledge because it contradicts entrenched beliefs.  It is named after <a href="http://en.wikipedia.org/wiki/Ignaz_Semmelweis">Ignaz Semmelweis</a> who proposed that doctors clean their hands to reduce the spread of infections but who was rejected in spite of the fact that he had data to support his claim.</p>
<p>Ronny&#8217;s &#8220;take home&#8221; points were:</p>
<ul>
<li>data trumps intuition &#8211; it&#8217;s hard to assess the value of ideas so do experiments;</li>
<li>get your organization to agree on what to optimize and use data to drive decisions.</li>
</ul>
<p>A video recording of Ronny&#8217;s presentation is available at <a href="http://www.dyyno.com/channel/sdforum#vod=1968">dyyno.com</a>.  More technical information including information on how to conduct experiments is available at <a href="http://www.exp-platform.com">http://www.exp-platform.com</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://ororke.com/paul/blog/?feed=rss2&amp;p=530</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Analytics:  The Unreasonable Effectiveness of Data</title>
		<link>http://ororke.com/paul/blog/?p=295</link>
		<comments>http://ororke.com/paul/blog/?p=295#comments</comments>
		<pubDate>Sat, 10 Apr 2010 03:30:05 +0000</pubDate>
		<dc:creator>Paul O&#39;Rorke</dc:creator>
				<category><![CDATA[Meeting Notes]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Machine Learning]]></category>
		<category><![CDATA[SDForum]]></category>

		<guid isPermaLink="false">http://ororke.com/paul/blog/?p=295</guid>
		<description><![CDATA[Peter Norvig focused on a major lesson learned at Google and elsewhere in recent years and gave a fascinating keynote presentation on &#8220;The Unreasonable Effectiveness of Data&#8221; at the SDForum conference on &#8220;The Analytics Revolution&#8221; April 9th, 2010.  The lesson is that data can be surprisingly effective:  it can be used to get [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://en.wikipedia.org/wiki/Peter_Norvig">Peter Norvig</a> focused on a major lesson learned at Google and elsewhere in recent years and gave a fascinating keynote presentation on &#8220;The Unreasonable Effectiveness of Data&#8221; at the SDForum conference on &#8220;<a href="http://ororke.com/paul/blog/?p=293">The Analytics Revolution</a>&#8221; April 9th, 2010.  The lesson is that data can be surprisingly effective:  it can be used to get better performance improvements than one can get from improvements in algorithms.<br />
<span id="more-295"></span><br />
In contrast to Wigner&#8217;s &#8220;<a href="http://www.physik.uni-wuerzburg.de/fileadmin/tp3/QM/wigner.pdf">The Unreasonable Effectiveness of Mathematics in the Natural Sciences</a>&#8221; Norvig&#8217;s presentation pointed out that in biology, natural language, and other complex domains, often it does not pay to strive for elegant mathematical formulas or compact, simple models or theories.  And it never pays to waste time trying for perfect models because as George Box said &#8220;&#8230;all models are wrong, but some are useful.&#8221;  Relatively simple methods can often be used to take advantage of ample data to build useful models.  The models may be relatively complex but sometimes the data seems to demand this and even more laborious methods for constructing models &#8220;by hand&#8221; may produce results that are at least as complex and more brittle.  An example of a rule base for spelling correction taken from <a href="http://www.htdig.org/">HTDig</a> was shown and it seemed to be very complex.  Peter pointed out that it would be difficult to change that rule base to extend it to another language but it would be relatively easy in a more data-driven approach, you would just need a lot of examples in the new language.  Peter remarked that data-driven programming is the ultimate agile method.</p>
<p>In many cases three steps need to be taken: choosing a representation language, encoding a model in that language, and then performing inference on the model.  Peter summarized his recommended approach with the acronym DINO:  Data In Non-parametric model Out.  Google&#8217;s Seti system for using machine learning to acquire models by learning from massive data sets is described by Simon Tong in the Google research blog at &#8220;<a href="http://googleresearch.blogspot.com/2010/04/lessons-learned-developing-practical.html">Lessons Learned Developing a Practical Large Scale Machine Learning System</a>.&#8221;</p>
<p><a href="http://www.hpl.hp.com/personal/Jaap_Suermondt/">Jaap Suermondt</a> gave a counterexample later in the day in his closing keynote.  In the example, unmanageable amounts of data needed to be processed to solve an optimization problem.  It was a linear programming problem but if turned out to be a special case that had a more efficient solution.  Even so, it turned out to be necessary to improve on that for the special problem at hand in order to get a solution in a reasonable time.  In this case, they had tons of data but it was just clutter until an improved algorithm was found that made it possible to get what was wanted out of the data.</p>
<p>Peter&#8217;s response to this counterexample is that Google also invests time into improving their algorithms.  They have a lot of nearest neighbor problems and need to avoid searching for nearest neighbors so they invest effort into locality sensitive hashing resulting in a simple algorithm.  So they are not dogmatic.  Even so, the point is that it is surprisingly often the case that data is more important than programs.</p>
<p>In trying to capture the gist of Peter&#8217;s presentation, I have skipped over a lot of great examples and interesting points.  A complete video recording of Peter&#8217;s presentation provided by Dyyno is available at &#8220;<a href="http://www.dyyno.com/channel/sdforum#vod=1984">Analytics Conference &#8211; Keynote &#8211; Peter Norvig</a>.&#8221;  <a href="http://www.computer.org/portal/web/csdl/doi/10.1109/MIS.2009.36">&#8220;The Unreasonable Effectiveness of Data&#8221;</a> also appears as an &#8220;expert opinion&#8221; article published in IEEE Intelligent Systems by Alon Halevy, Peter Norvig, and Fernando Pereira, pp. 8-12, March/April, 2009.  Seeds of the notion that more data can be better or more important than trying for better algorithms on smaller datasets appeared in an earlier presentation Peter gave at PARC Forum in 2006 on &#8220;<a href="http://www.parc.com/event/499/web-search-as-a-product-of-and-catalyst-for-ai.html">Web Search as a Product of and Catalyst for AI</a>.&#8221;</p>
]]></content:encoded>
			<wfw:commentRss>http://ororke.com/paul/blog/?feed=rss2&amp;p=295</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

