The key to any identity-based attack is impersonation—manipulating components of an email message to exactly match or bear similarity to identity markers in a real message sent from a trusted identity. The most common message components that have these identity markers are the From header, the Subject header, and the body of the message.
Of these, the From header is the most commonly recognized identity marker, as it is displayed prominently in most email clients. It is also the identity marker that is most commonly abused since the sender of a message can specify any value for it. The Subject header and body can contain identity markers, such as words, phrases, brand names, logos, URLs and narrative structures, but these are often secondary to those in the From header and primarily serve to support, rather than define, the perceived sending identity for a message.
The From header is generally made up of two parts: a display name that is the suggested display label for an email client and an email address, which in itself has a local part and a domain. For example, the From header “Bo Bigboss” <firstname.lastname@example.org> has a display name of “Bo Bigboss,” a local part of “hackyjoe666,” and a domain of “gmail.com.”
Since, as shown below, many email clients show only the display name in certain views, Display Name Attacks are the most common form of identity deception. Attackers often insert the identity of a trusted individual (such as the name of an executive of the targeted company) or a trusted brand (such as the name of the bank used by the targeted individual) into the display name. Since common consumer mailbox services such as Gmail and Yahoo allow a user to specify any value in the display name, this type of attack is simple and cheap to stage from such a service.
In addition to manipulating the display name, an attacker may also use the actual email address of the impersonated identity in the From header, such as “United Customer Service” <email@example.com>. This type of attack, known as a Domain Spoofing Attack, does not require compromising the account or the servers of the impersonated identity, but exploits the security holes in the underlying email protocols. Attackers often use public cloud infrastructure or third-party email sending services that do not verify domain ownership to send such attacks. Email authentication standards, such as DMARC, can be used by a domain owner to prevent spoofing of their domain, but are still not adopted widely by popular brands and government organizations.
In cases where a domain is protected by email authentication and domain spoofing is not possible, attackers try to deceive the recipient by registering and using domains that are similar to the impersonated domain. These types of attacks, known as Look-alike Domain Attacks, often use homoglyphs or characters that appear similar to the original characters in the impersonated domain. Attackers can use rendering similarities, such as “PayPal” <firstname.lastname@example.org>, exploiting the specific fonts and rendering styles used in popular email clients. Another variation of the Look-alike Domain Attack is to add additional words to the domain name. For example, if an attacker wanted to send you a bogus invoice from Acme Corporation, whose domain might be acme.com, the attacker could simply register acme-payments.com, or invoices-acme.com. Finally, attackers can use characters from another script in the Unicode set. Cyrillic is a common choice, as in the From header “Dropbox” <notifications@ dropbox.com>, where the “o”s in the domain are actually Cyrillic characters, but an email client will render the version that looks exactly like the impersonated domain.
Finally, the most pernicious form of identity deception can take place when the attacker has compromised the email account or server of the identity they are impersonating. This type of attack, known as an Account Takeover Attack, while low in volume, is generally the hardest to detect, since it leverages the identity markers, infrastructure, and many of the behavioral characteristics of legitimate messages from that identity.
While the various forms of identity deception attacks may differ in prevalence and sophistication, they have some similarities. First, they manipulate the perception of the recipient, convincing them that the message was sent by an identity with which they are familiar. Second, they exploit the trust that the recipient has in that identity, convincing the recipient to take some action or disclose some information that they assume is safe. Security awareness and phishing training can help a recipient detect some of these attacks, but the burden of detection can’t fall to the individual as the quality and volume of identity deception increases. Instead, these attacks have to be detected and blocked by the next generation of secure email cloud email security solutions.
To help prevent these types of attacks from reaching inboxes, Agari has developed an advanced threat prevention technology primarily focused on detecting the sophisticated identity deception attack. The Agari Identity Graph uses advanced machine learning techniques, Internet-scale email telemetry, and real-time data pipelines to detect and block such attacks with high efficacy.
The three phases of the Agari Identity Graph™ are:
Each of these phases leverages a variety of algorithms and machine learning models to come up with highly accurate answers to these questions. The results of these questions are combined to determine an overall score for a message. The final score represents the probability that the message can be trusted and can be used to take a policy-based action.
The three phases are described in greater detail in the subsections below.
The goal of Identity Mapping is to determine one or more Identities that a recipient might perceive is sending a message. The Identity Mapping phase scans the message for identity markers and maps these to existing behavioral models that may exist for the perceived Identities. The perceived identities may represent individuals or organizations and, in fact, it is common for Identity Mapping to find multiple Identities that have a high likelihood of being perceived. Identity Mapping uses a variety of machine learning (ML) models and Natural Language Processing (NLP) techniques to extract the perceived identities from a message.
Consider the following message from Ravi Khatod, the CEO of Agari, to the rest of the company:
As the diagram shows, the identity markers within the message are tokenized and mapped to an Individual Identity for Patrick Peterson, and the Organizational Identity representing the company Agari.
While this example may seem relatively straightforward, consider the following two additional examples:
Both of these examples map to the same Individual Identity as the first example, even though the Google Drive message wasn’t sent directly from an account controlled by Patrick. The display name is the primary identity marker in all of these messages, but the email address and the subject can also influence the mapping.
All of these examples are variants of legitimate messages from Patrick that may be delivered to employees of Agari, but they could also represent attacks sent using his Individual Identity. As such, the output of the Identity Mapping phase in itself does not determine the trust or risk associated with a message, but all likely identities detected are mapped to behavioral models used by subsequent detection phases.
Messages can also be mapped to organizational identities representing brands, as in the examples below:
In these cases, identity markers in the subject support the likelihood of the identity detected in the From header, disambiguating uncertainty in the identity markers in the display name. The Dropbox example demonstrates a common masking technique used by attackers—characters had to be mapped to their homoglyphs in the original script to determine the correct identity.
In these cases, the Agari Identity Graph™ maps messages to class identities, leveraging behavioral models that can be applied to larger classes of senders. For example, a message with a From header “Bank of Freedonia” <email@example.com> would be mapped to a Financial Services Class identity. Or a message from “Joe Smith” <firstname.lastname@example.org> would be mapped to a Consumer Mailbox Class identity.
Together, the hierarchy of perceived Identities at the individual, organization, and class level are the starting point for the behavioral and trust phases of understanding message trust.
The behavioral model for an identity represents the expected sending behavior for that individual, organization, or class. It is an ML model trained specifically for that identity, based on legitimate message traffic previously seen by the Agari Identity Graph. This technique of “modeling the good” rather than trying to detect the pattern of the bad is a critical element of the efficacy of the Agari Identity Graph and the key difference from previous generations of email security detection.
A new message mapped to a specific identity is evaluated for anomalies relative to that identity’s behavioral model. All legitimate messages will also be used as training samples for subsequent refinement. Machine learning sub-models are used to compute certain classes of feature values and detect specific types of deception, so the scoring process uses a set of different models to determine a final trust score.
A new message is evaluated for anomalies relative to the expected behavior of the mapped identities. The diagram below shows that the expected behavior of messages coming from Patrick Peterson falls within a band of values across 100+ features falling into several classes. This band is determined by using all previously seen legitimate messages from Patrick as training samples.
The orange line in the diagram above represents the feature values for a single message. Each message is compared to the model and deviation of feature values from the expected band are compiled into anomaly metrics. The features used for this type of anomaly detection are computed at both the global level to determine how the mapped identity usually sends to the Internet as a whole, and at the local level to determine how the mapped identity typically sends to the specific recipient. The overall risk associated with a message depends on computed differences for each mapped identity-based model and thresholds specific to the relationship, as described in the Trust Modeling section below.
These changes are common, but are rarely without other balancing signals. For example, a new legitimate sending server is often either in the network neighborhood of other associated servers that send for the identity, or it is owned by a service that validates identity ownership. Heuristics are also applied to address the multitude of false positive and false negative cases that can occur in the wild.
As its last step, the Agari Identity Graph determines the relationship between the perceived identity and the recipient of the message so as to measure the trust that the recipient will attribute to the message. The chances that a recipient will open a message and be impacted by a malicious attack using an identity that has a close or “highly engaged” relationship to the recipient, such as a CEO
to CFO, is significantly higher than that from a further or “less engaged” relationship, such as from a domain that has never previously sent an email to anyone in the organization. To account for this higher risk:
The following graphic depicts a portion of the Trust Model for Agari as a company and demonstrates two common and close relationships:
The first shows a relationship between two executives within Agari; the second shows the relationship between a recipient within Agari and other organizations that have a close relationship with Agari. Both of these are highly engaged relationships and any detected deception will result in low trust scores for the associated messages.
The sources of Trust Modeling features include:
The risk of the recipient opening the ransomware attachment from this new identity is lower than if it were to come from a more engaged relationship. However, the low engagement of the new identity, the presence of active code in a document-based attachment, and the sensitive role of the target recipient will lower the final score of the message so that it can be blocked.