Tom Fawcett, Machine Learning Architect at Proofpoint, gave the San Francisco Bay Area ACM Data Mining SIG an insider’s view of email filtering on Monday, October 25th, 2010. Proofpoint has thousands of customers large and small and guarantees in their service level agreement that customers will get no more than one spam message per 350,000 emails. Tom pointed out that research on spam filtering has little to do with what companies do in practice in the “real world” and then he revealed a lot about how commercial spam filtering works.
Everyone applies induction looking for spam in “bags of words” but researchers have favored Support Vector Machines, random forests, and ensemble methods. Everyone then does cross validation and accuracies of 99% are common.
Industrial applications differ from academic research in part because spam is one of those areas where there is a continuing arms race as both spammers and filters become ever more sophisticated. Tom showed a number of examples of spam illustrating the fact that text models alone are insufficient.
In industry, multiple stages are used with more intensive text processing reserved for later stages after earlier simpler stages have already identified half or more of the spam. Proofpoint uses reputation models on the IP addresses of the sender and on domains and URLs, even taking into account how long the owner has held a domain and so on. They use Spam Assassin, a public domain regular expression and rule based content matching system. But that is just a small step in their overall process. It provides some of the information used but Proofpoint’s lexicon and content models provide most of the leverage.
Interestingly, while most researchers use a simple representation (bag of words) with complex learning methods such as Support Vector Machines (SVM), Proofpoint uses a complex representation including context markers, AND/OR rules as well as phrases and words. They use a very simple classifier and learning method: they use logistic regression to assign positive and negative weights to terms. SVM, random forests, ensembles and so on don’t offer much improvement. R^2 is about .95 so there is no need for better accuracy. This is because Proofpoint’s representation is sophisticated and already incorporates attribute interactions, conditional dependencies and so on. There are 1.3 million features going into the logistic regression.
A major advantage of having a simple classifier and using logistic regression is that it is comprehensible and transparent. A “white box” is better for this application than opaque “black box” approaches. When something is classified as spam, an analyst can run it through the classifier and show the terms that influenced the classification, their weights, and so on.
Proofpoint reparses their corpus every 24 hours and updates the classifier in about 1.5 hours. They age things out of their corpus so that it can change over time as spam changes over time. In order to be able to respond quickly to spam attacks, they also have a special “fast attack learning and response” system that adds new terms to the model every fifteen minutes. Usually the representation is wrong rather than the model, so it is not necessary to rebuild the model from scratch and it suffices to improve the representation by adding terms.
Proofpoint does well in competitions against their competitors in part because they do text analysis whereas many of their competitors only do IP and domain filtering. In an ongoing effort to stay ahead of spammers, link mining and network analysis are now being looked at as potential sources of information that can support spam identification.