Real-time C2 blocking via Random Domain name identification

With credit to Brandon Niemczyk
HP TippingPoint DVLabs

 

Many pieces of malware use automated domain generation to increase the robustness of their network and avoid being blacklisted. It puts much more burden onto the party trying to block by DNS blacklisting.

 

Why can AV vendors not just reverse the algorithms and sink-hole all the domains that will be generated?

The traditional method of reversing the domain generation algorithm and pre-determining the domains to block has 2 major pitfalls:

 

  1. The time it takes to reverse engineer is longer than the time it takes to write a new generation algorithm, leading to an arms race the good guys can never win.
  2. Domain algorithms may include seed information unknown at the time of writing and reversing, such as the 4th most popular YouTube video on the day of generation. So that even if algorithm is fully understood and reversed, there is still no way to know ahead of time what the generated domains will be.

0-day C2 coverage

Using a statistical approach allows domains to be discovered and blocked real-time, without any prior knowledge of the malware, protecting patient zero.

 

Method Overview

We want to look at the domain name and automatically decide if it is meant for humans to read, or if it is likely generated by an algorithm. Intuitively, this is easy for a human to do, but requires a bit more thought to automate. So we will leverage old-school, cipher-analysis techniques but instead of posing the question, “Does this key applied to this text produce valid output?” we will ask “Is this string likely to be human readable?”

 

We are going to need to select some data points to analyze. The simplest solution is to associate a probability of showing up in English with every character (called 1-grams) and character pair (called 2-grams). Then by analyzing the probability of a certain string being generated, we can decide if it is likely to be English. The upside to this is it is easy to generate the probabilities and does not require any understanding of the semantics in the language we are targeting. In fact, given a string, we just take all the probabilities for 1-grams and 2-grams and take the mean of each type. Those 2 numbers can be used to determine how likely a string is to be our target language.

 

To understand why these will be useful data points to look at, let’s take a look at histograms of the means for datasets of both English and randomly-generated domains.

 

In the following histograms, the x-axis is the mean probability of occurrence in English and the y-axis is the number of occurrences in the training data. As an illustration, a random 2nd level domain may be "jfkdslvjiewnfi0e2jlfjksl" and an English 2nd level domain may be "facebook."

 

2-Grams (example: AA - ZZ)

  • English domain 2-gram mean distribution

mtp_valid.png                                                                                                                

  • Random domain 2-gram mean distribution
                     mtp_invalid.png

1-Grams (example: A - Z)

  • English domain 1-gram mean distribution
                   mf_valid.png
  • Random domain 1-gram mean distribution
                   mf_invalid.png

By looking at these histograms, we can see 2 immediately useful things:

 

  1. They seem to follow a normal distribution
  2. Overlap between English domains and Random domains does exist, but is minimal. So by using both inputs we should be able to classify correctly a large percentage of the time.

The algorithm

So the idea is simple, when given a domain do the following:

 

  1. Parse out the second level domain [It is important we use only the second level domain because there are legit uses of random subdomains (Akamai, AEWS, etc.), and we do not want these to be flagged as malicious.]
  2. Calculate the mean 1-gram and 2-gram probabilities for 2nd level domain.
  3. Map the means for the domain to a probability that the domain is random. (In the following section we will define how to do this.)

A function to map means to probability of being random

The simplest solution is to view mean 1-gram, mean 2-gram a

s a point on a 2-D plane and find a separating line for all our training data. This could be done with logistic regression. But this generates more false positives and false negatives than is acceptable, so we need to find a non-linear solution.

 

Finding the mapping function

We wrote a fuzzy logic module for Mathematica which takes a set of fuzzy rules and translates them into equations, here are the fuzzy logic rules we used:

 

$\text{very}(mf \neq englishmf) \vee \text{very}(mtp \neq englishmtp) \vee (mf \neq englishmf \wedge mtp \neq englishmtp) \mapsto \text{RANDOM}$

 

$1<13 \vee \neg \text{RANDOM} \mapsto \text{ENGLISH}$

 

$englishmf \text{ and } englishmtp$ are parameterized gaussian functions scaled to peak at 1.0. Equality with them is by taking their value; inequality is 1 - taking the value. So equality and inequality return a value between 0 and 1, the very() function squares that value, making it so it must be higher to pass.

 

The fuzzy logic module then takes these rules and generates a large parameterized equation with parameters $\Theta$ for the distributions that represent englishmf and englishmtp. We can then use whatever method we want to find the best $\Theta$ such that RANDOM values map to 0 and ENGLISH values map to 1 as often as possible. After fitting using a squared error cost function, we ended up with the equation in figure 2. Which, when ignoring the $1$ dimension, can be rendered (see Figure 1).

 

Figure 1 (Graph of final mapping function)

 mappingfunction.png

 

 

Figure 2 (Final mapping function)

 

$ \max[ \\ \text{ } 0.5 + 0.5 \text{ erf}(14142.1 - 707.107 l), \\ \text{ } 1 - \max[ \text{ } (-1 + 2.71828^{-1371.59(-0.0759477+mf)^2})^2, \\ \text{ } (-1 + 2.71828^{-445.707(-0.114675+mtp)^2})^2, \\ \text{ } \min[1 - 2.71828^{-1371.59(-0.0759477+mf)^2}, 1 - 2.71828^{-445.707 (-0.114675+mtp)^2}]]] \\ $

Where erf = error function, l= length of 2nd level domain, mf= mean 1-gram probability, and mtp=mean 2-gram probability

 

Performance

After training, we created a testing set with 5000 random domains and 5000 English domains, then ran this algorithm against them, the results are below with blue dots for the English domains and red dots for random ones.

 mappingresults.png

 

There are a total of 114 random domains that have a value > 0.5 and only 8 English domains that have a value < 0.5, giving a 1.22% error rate and only 0.16% false positive rate.

 

Conclusion

Statistical analysis of domain names can be a very useful way to find C2 traffic for many malware families currently in the wild, including unknown malware. There are many ways to do this. It is easy to get around (just smash English words together for your domain generation algorithm), and is certainly not a silver bullet, but it can be useful. We used it for labeling DNS data to do further research on finding infected hosts using Markov models which we presented at VirusBtn 2013.

 

Much research in this area has been done. Damballa has done work in attributing a particular domain name with a specific malware family.

Comments
Knapovsky_hp | ‎12-20-2013 06:02 AM

Did You also test with some non-english, for example German or some East European language? What About Polish, Czech, Slovak, will it still work?

 

Anyhow, I like this a lot.

 

THX

./Miro

Web Hosting Bangalore(anon) | ‎01-07-2014 01:13 AM

I am sure this information will help large number of people to focus on some of the important aspects of the topic.

31douglas | ‎02-05-2014 02:48 AM

Lovely post and correlates exactly with the trends we see of domain names containing english words smashed together instead of jibberish.

Brandon Niemczyk(anon) | ‎02-05-2014 12:00 PM

@Knapovsky_hp

 

You would need to re-build your distributions to use it on languages that are not "close enough" to English. On the other hand, we had several Spanish domains in our data and it did not create false positives.

Leave a Comment

We encourage you to share your comments on this post. Comments are moderated and will be reviewed
and posted as promptly as possible during regular business hours

To ensure your comment is published, be sure to follow the Community Guidelines.

Be sure to enter a unique name. You can't reuse a name that's already in use.
Be sure to enter a unique email address. You can't reuse an email address that's already in use.
Type the characters you see in the picture above.Type the words you hear.
Search
Showing results for 
Search instead for 
Do you mean 
About the Author
Steve Povolny manages the Digital Vaccine team at HP TippingPoint. The team is composed of security researchers and filter/signature develo...


Follow Us
The opinions expressed above are the personal opinions of the authors, not of HP. By using this site, you accept the Terms of Use and Rules of Participation