Reading is like riding a bicycle: once you master it, it feels easy and automatic, and you quickly forget how much effort it took to learn. For example, we are normally not aware that we move our eyes 3 or 4 times per second as we read, glancing at each word on a screen or page for a few hundred milliseconds. Nor do we realize that only a portion of a word is visible and in focus at any given moment (1). Unfortunately, the speed and ease of reading can also be used against us during everyday tasks like reading email. In particular, a malicious actor can make a subtle change to an email address, transforming it into a lookalike domain that seems trustworthy and familiar when it actually may be the first step of a phishing attack.
Lookalike domains fool us by taking advantage of the fact that, in less than a second, we’ve already glanced at the subject line and the sender’s name and decided whether the sender’s email address is familiar or not. We thought we saw microsoft.com but in fact it was rnicrosoft.com. Or perhaps we read the domain as apple.com when something feels a little odd, and on second glance we realize it’s actually app1e.com. (Yes: look again — that “l” in the second apple is actually the number 1!).
Lookalike domains not only “hack into” the rapid and perhaps automatic way we read email, but they also take advantage of the halo effect, which is the tendency for positive feelings about a person, idea, or thing to “spread” or transfer to other aspects of experience (2). So when we think we see a familiar, trusted company or brand name in the email address, we’re more likely to assume the message itself (e.g., the content, links, attachments, etc., it includes) can be trusted. We lower our defenses, and click on the message, and if in fact this is a carefully engineered phishing message, we’ve taken the bait.
Lookalike domains — let’s call them lookalikes — can be divided into two categories. One set are domains that are created by modifying a known domain with letter substitutions, additions, or subtractions. A second set of lookalikes will often use the real domain (or optionally, a modified version) and embed it into a larger string, such as bestdeals-amazon.com, or perhaps it will use embedding plus modification, like bestdeals-amazan.com.
In the current article, we focus on the first type of lookalike, as they are far more subtle and often, challenging to catch. In the next section, we discuss three specific challenges.
It’s fairly trivial to recognize that an email address from the domain amazan.com — once that third “a” is spotted — is spoofing or targeting the domain amazon.com. Building an automated system that can identify the target domain is far from trivial. In particular, asking a machine to tell us, “Which domain does amazan.com look like?” from a universe of possibilities is computationally expensive (e.g., do we search through a giant list of known domains?). In other words, identifying the target of the lookalike automatically and efficiently is a difficult task.
There are a number of potential solutions. One option, as we hinted above, is to create a “dictionary” of known good domains, and to exhaustively compare the candidate lookalike to each entry in our domain dictionary (we can of course optimize this brute-force search in a number of ways). The advantage of the dictionary look-up approach is that, once we have a target, we can implement a fast and lightweight method for comparing the candidate and target domains. The cost, on the other hand, is the time and effort spent searching for potential targets to compare against.
Alternatively, we can construct a method that maps all our known, good domains into a manageable “metric space.” Then, given a candidate lookalike domain, we need only map the candidate into the same space, which gives us for free the targets that are nearest to the candidate. In other words, the advantage of this second strategy is that our mapping method generates one or more targets for a given candidate lookalike at virtually no cost. The downside, however, is the time and effort spent training a machine-learning model that computes this domain-mapping method.
So now we have our lookalike and its intended target in hand. The second challenge is defining a similarity metric. Of the following lookalike candidates, which one looks more like (is most similar to) amazon.com:
amazan.com or arnazon.com or amazn.com
One approach to this question is to use edit-distance as our similarity metric. Similarity is defined as the cost (i.e., the type and number of edits) of transforming one word into another. “Lower cost” means “more similar.” For example, how many letters do we have to add/remove/substitute to go from:
amazan.com to amazon.com
The appeal of approaches like measuring edit-distance, which operate directly on the letters, is they are fast and easy to compute. Thus, they complement expensive target-search methods, like the dictionary look-up strategy we mentioned earlier: a low-cost similarity metric, but a high-cost search method for identifying targets.
Edit-distance methods have two additional important features. First, they tend to use “hand-coded” rules, not only for determining which letters (or more accurately, type-written characters) resemble each other, but also the relative cost of each edit (e.g., how much does one addition and two substitutions cost?). A second feature is that edit-distance methods are well-suited to smaller character sets, like the Latin alphabet. Once we include the larger universe of unicode characters, creating sets of character-level lookalikes (i.e., homoglyphs or “confusables”) becomes more challenging.
A very different approach, which we are developing at Agari, uses image-based similarity. The image-based approach takes its inspiration from the field of reading research, and in particular, from the physiological and psychological mechanisms that support skilled reading. Instead of treating domains as strings of individual characters, we instead convert each string into an image:
From this perspective, individual letters, numbers, and punctuation are no longer the starting point — rather, we begin with domains as 2D arrays of pixels. Similarity is then, in a sense, how our eyes “naturally” see it (e.g., how a tiger and a lion look similar, or an apple and a pear, etc.). The challenge, of course, is building a model that learns to “see” that the image:
The image-based approach has several advantages over edit-distance methods. First, we don’t have to arbitrarily hand-code rules for which characters “look like” other similar ones, or for measuring how similar one character is to another. Instead, character features (like the fact that a “p” and a “q” have a loop and a vertical line) emerge without explicitly teaching the model to detect them. Second, there are a variety of well-defined, well-understood distance metrics that measure how far apart two images are, that is, how similar or different they are.
The trade-off then is: image-based methods provide a quick and powerful way to measure the similarity between a lookalike and target domains, and more importantly, they also generate one or more potential targets that “look like” a given candidate lookalike, without requiring brute-force search. The catch is that building, training, and validating a model that can do this takes time and effort. In the final section, we briefly describe how that model-building process works.
The third challenge is demonstrating malicious intent. Just because an email arrives from a domain that looks an awful lot like amazon.com or microsoft.com does not prove it’s malicious — we need more evidence besides close similarity. Addressing that challenge is beyond the scope of this article but some potential questions to ask are: Is the lookalike domain registered, and if so, how old is it? What do we know about the history of email from this domain? What other message features (e.g., the infrastructure used to deliver it, the subject line, the intended recipient, etc.) suggest malicious intent?
In the final section, we provide an end-to-end overview of implementing an image-based model for detecting lookalike domains, which is divided into 5 steps: (1) building an image library, (2) training a “bottleneck” model, (3) creating a method for identifying targets, (4) measuring lookalike-target similarity, and finally, (5) identifying malicious lookalikes.
Fig 1. Well-known domains are converted from strings into 2D images.
Fig 2. The well-known domain images are used to train a bottleneck model (in this case, we use a convolutional autoencoder network to illustrate the process).
Fig 3. The compressed feature vectors from the well-known domains are projected into a low-dimensional metric space and grouped with an unsupervised learning algorithm. In the simplified example, the lookalike candidate amazan.com is nearest to the cluster that contains amazon.com and adobe.com.
Fig 4. The distance between the lookalike and two targets is measured, and if the closest target is within the threshold, the lookalike is labeled a match.