Exploring the Ashley Madison Dataset

I first heard about the Ashley Madison breach on July 15, 2015 in a post by Brian Krebs. I immediately wondered what the fallout of such a breach would be. Would Ashley Madison’s new tagline be “1 million divorces and counting!” Would the perpetrators try to profit from the stolen data, perhaps through blackmail? I never imagined I’d soon have the chance to explore the dataset myself, after Forbes published a link to the data.

Most breaches are not discovered immediately. If you were to look at the creation timestamp for members, you would find that, since January 1 there have been between 25K and 30K new records created each day, right up until February 23rd at 12:26:13. (The timezone is unknown but likely to be either UTC or Toronto time). Could this be the precise moment the data was pilfered? Or, did the thieves merely stumble upon an older backup taken at that time? These are interesting questions, but I actually had another purpose in mind.

At my previous company I needed a large volume of realistic email messages in order to create some DKIM test cases. Fortunately I stumbled upon the Enron email corpus released as part of the public record after the company imploded. I wondered if perhaps the Ashley Madison data dump could be similarly useful.

One question I’ve had for a while is “What percentage of consumer email addresses are protected using the DMARC standard, broken down by country?” On the surface this seems like an easy question: If you have a hotmail.com, outlook.com, gmail.com, yahoo.com, aol.com, or comcast.com email address, you are protected by DMARC and if you use some other service you are not protected. It’s actually more complicated than that; Yahoo manages email for Rogers, AT&T, SBC, and a number of other entities. All 5 million+ Google Apps domains support DMARC reporting and enforcement. A year or two ago I wrote a script that takes a list of domains, performs an MX lookup on each one, and then determines if the domain supports DMARC enforcement on inbound email. My script should get me part way there, but to do the country breakdown I was going to need a representative consumer email list that included each consumer’s country.

The recent Ashley Madison data dump includes all that and much more. Using the leaked data, I could theoretically figure out DMARC coverage by country, gender, or even sexual proclivities. For the purposes of this post, I’ll stick to a breakdown by country.

DMARC.org claims that typical DMARC coverage in the USA is around 85% for consumer mailboxes, and about 60% globally. Let’s see if that holds true in the Ashley Madison data.

After following the Forbes link using Tor, I obtained a .torrent file which I used to download the dataset. After installing mysql, I imported the data into relational tables began my investigation. I extracted the domain portion of all email addresses and ran my shell script to determine DMARC coverage for each domain. After importing these results back into mysql, I was finally able to perform my analysis.

For the Ashley Madison dataset, 90.68% of all email addresses linked to US-based members support the DMARC standard today. That’s more than 5 points higher than the DMARC.org estimate. Global coverage is 87.26% which is much better than the DMARC.org estimate of 60%.

Below is a breakdown by country, showing the percentage of Ashley Madison member addresses hosted at a DMARC-compliant provider:






















That’s all for today; but there are a lot more juicy stories hidden in this data…stay tuned!