I attended a security panel recently where the panelists were asked what areas or approaches in security, if any, were over-hyped. One of the panelists, the CISO of a well-regarded Valley startup, said “Machine Learning. If a vendor comes in and starts talking about how they use Machine Learning, I start tuning out.”
Not long after that, I was briefing a well-known analyst about Agari’s new Enterprise Protect product. As I described our approach, which leverages Machine Learning, he said “It seems like you’re boiling the ocean. Why not just use a simple set of rules?”
I believe that both the CISO and the analyst were reacting to the overuse, and sometimes misuse, of the term “Machine Learning” in marketing security solutions. Many vendors use Machine Learning, or the broader field of Data Science, to attempt to associate a degree of complexity and sophistication with their products, sometimes leveraging jargon as a way of obscuring the constraints or limitations of their approaches.
While the underlying techniques associated with Machine Learning can be complex, the primary goal is relatively simple – to use a set of known examples (e.g. identified attacks) to train a generalized model that can estimate a value or a verdict for previously unknown examples (e.g. new attacks). Earlier this year, I delivered a presentation to the eCrime Congress in London that focused on demystifying the application of Machine Learning (ML) in security solutions and helping security buyers make informed decisions.
The first part of the talk focused on the core components of any (Supervised) Machine Learning process:
- Features – A set of identifying characteristics that best represent what you’re trying to model…or sometimes what you’re trying to differentiate between. If you’re looking at network traffic, these features may include throughput, port usage stability, average packet size and potentially many others. The selection of the features to use for Machine Learning, and their ability to separate the different classes (e.g. legitimate vs. malicious traffic) is a critical part of the ML process.
- Training Set – A collection (or corpus) of examples of the different classes you want to identify. Traditional anti-spam solutions, for example, would have large bodies of “spam” (bad email) and “ham” (good email) to use for training. A key requirement for most ML algorithms is that this training set be “labeled” – every example needs to be correctly classified so that the algorithm can correctly learn how to classify from these examples.
- Algorithm – A statistical technique for building a generalized model from the training examples.The model generated can essentially be thought of as a formula that takes the features of any single example and generates a verdict, a class or a value for that example (e.g. is it a legitimate email message or a malicious one). There are a large and growing variety of ML algorithms, each with its own combination of tuning complexity, operational cost and ability to recognize patterns.
- Accuracy – A measure of how good the model is at differentiating between the different classes you are looking to identify. The accuracy is generally measured on a testing set or in real world situations and, while there are multiple measures of accuracy, False Positive (i.e. a false alarm in a security context) and False Negative (i.e. an attack that made it through) rates are commonly measured.
The traditional Supervised Machine Learning process involves selecting features, collecting a labeled training set, choosing and tuning an algorithm to train a model and measuring the resultant accuracy on a testing set. If the accuracy measures are acceptable, then you have a model that can be put into the wild. If not, you may have to select more or different features, collect better training examples, tune or replace your algorithm and do so repeatedly until the accuracy measures meet your expectations.
While the art (and sometimes resultant perception of “voodoo” science) of Machine Learning is the correct selection and tuning in this iterative process, this underlying approach is straightforward and consistent. The core components of Features, Training Set, Algorithm and Accuracy can be used to understand, analyze and evaluate any security solution based on Machine Learning.
Next week, I’ll post a follow-up blog on how to apply this framework for evaluating Machine Learning claims to make better, more informed decisions regarding security solutions.
If you’re interested in learning more, don’t miss our webinar on June 23rd: Machine Learning in Security: Detecting Signal in the Vendor Noise.