
Discussion BoardsOpen MenuDiscussion Boards Open Menu
 Welcome to the Community
 1 categories, 7 boards
 Live with HP Experts
 1 boards
 Converged Systems
 1 categories, 1 boards
 Desktops and Workstations
 1 categories, 12 boards
 Mobile
 7 boards
 Networking
 6 categories, 22 boards
 Operating Systems
 7 categories, 77 boards
 Printing and Digital Imaging
 1 categories, 17 boards

BlogsOpen MenuBlogs Open Menu

Community Knowledge BaseOpen MenuCommunity Knowledge Base Open Menu

EnglishOpen Menu
 Community Home
 >
 Software
 >
 Enterprise Security
 >
 HP Security Research Blog
 >
 Realtime C2 blocking via Random Domain name ident...
 Subscribe to RSS Feed
 Mark as New
 Mark as Read
 Bookmark
 Receive email notifications
 Email to a Friend
 Printer Friendly Page
 Report Inappropriate Content
Realtime C2 blocking via Random Domain name identification
With credit to Brandon Niemczyk
HP TippingPoint DVLabs
Many pieces of malware use automated domain generation to increase the robustness of their network and avoid being blacklisted. It puts much more burden onto the party trying to block by DNS blacklisting.
Why can AV vendors not just reverse the algorithms and sinkhole all the domains that will be generated?
The traditional method of reversing the domain generation algorithm and predetermining the domains to block has 2 major pitfalls:
 The time it takes to reverse engineer is longer than the time it takes to write a new generation algorithm, leading to an arms race the good guys can never win.
 Domain algorithms may include seed information unknown at the time of writing and reversing, such as the 4th most popular YouTube video on the day of generation. So that even if algorithm is fully understood and reversed, there is still no way to know ahead of time what the generated domains will be.
0day C2 coverage
Using a statistical approach allows domains to be discovered and blocked realtime, without any prior knowledge of the malware, protecting patient zero.
Method Overview
We want to look at the domain name and automatically decide if it is meant for humans to read, or if it is likely generated by an algorithm. Intuitively, this is easy for a human to do, but requires a bit more thought to automate. So we will leverage oldschool, cipheranalysis techniques but instead of posing the question, “Does this key applied to this text produce valid output?” we will ask “Is this string likely to be human readable?”
We are going to need to select some data points to analyze. The simplest solution is to associate a probability of showing up in English with every character (called 1grams) and character pair (called 2grams). Then by analyzing the probability of a certain string being generated, we can decide if it is likely to be English. The upside to this is it is easy to generate the probabilities and does not require any understanding of the semantics in the language we are targeting. In fact, given a string, we just take all the probabilities for 1grams and 2grams and take the mean of each type. Those 2 numbers can be used to determine how likely a string is to be our target language.
To understand why these will be useful data points to look at, let’s take a look at histograms of the means for datasets of both English and randomlygenerated domains.
In the following histograms, the xaxis is the mean probability of occurrence in English and the yaxis is the number of occurrences in the training data. As an illustration, a random 2nd level domain may be "jfkdslvjiewnfi0e2jlfjksl" and an English 2nd level domain may be "facebook."
2Grams (example: AA  ZZ)
 English domain 2gram mean distribution
 Random domain 2gram mean distribution
1Grams (example: A  Z)
 English domain 1gram mean distribution
 Random domain 1gram mean distribution
By looking at these histograms, we can see 2 immediately useful things:
 They seem to follow a normal distribution
 Overlap between English domains and Random domains does exist, but is minimal. So by using both inputs we should be able to classify correctly a large percentage of the time.
The algorithm
So the idea is simple, when given a domain do the following:
 Parse out the second level domain [It is important we use only the second level domain because there are legit uses of random subdomains (Akamai, AEWS, etc.), and we do not want these to be flagged as malicious.]
 Calculate the mean 1gram and 2gram probabilities for 2nd level domain.
 Map the means for the domain to a probability that the domain is random. (In the following section we will define how to do this.)
A function to map means to probability of being random
The simplest solution is to view mean 1gram, mean 2gram a
s a point on a 2D plane and find a separating line for all our training data. This could be done with logistic regression. But this generates more false positives and false negatives than is acceptable, so we need to find a nonlinear solution.
Finding the mapping function
We wrote a fuzzy logic module for Mathematica which takes a set of fuzzy rules and translates them into equations, here are the fuzzy logic rules we used:
$\text{very}(mf \neq englishmf) \vee \text{very}(mtp \neq englishmtp) \vee (mf \neq englishmf \wedge mtp \neq englishmtp) \mapsto \text{RANDOM}$
$1<13 \vee \neg \text{RANDOM} \mapsto \text{ENGLISH}$
$englishmf \text{ and } englishmtp$ are parameterized gaussian functions scaled to peak at 1.0. Equality with them is by taking their value; inequality is 1  taking the value. So equality and inequality return a value between 0 and 1, the very() function squares that value, making it so it must be higher to pass.
The fuzzy logic module then takes these rules and generates a large parameterized equation with parameters $\Theta$ for the distributions that represent englishmf and englishmtp. We can then use whatever method we want to find the best $\Theta$ such that RANDOM values map to 0 and ENGLISH values map to 1 as often as possible. After fitting using a squared error cost function, we ended up with the equation in figure 2. Which, when ignoring the $1$ dimension, can be rendered (see Figure 1).
Figure 1 (Graph of final mapping function)
Figure 2 (Final mapping function)
$ \max[ \\ \text{ } 0.5 + 0.5 \text{ erf}(14142.1  707.107 l), \\ \text{ } 1  \max[ \text{ } (1 + 2.71828^{1371.59(0.0759477+mf)^2})^2, \\ \text{ } (1 + 2.71828^{445.707(0.114675+mtp)^2})^2, \\ \text{ } \min[1  2.71828^{1371.59(0.0759477+mf)^2}, 1  2.71828^{445.707 (0.114675+mtp)^2}]]] \\ $
Where erf = error function, l= length of 2nd level domain, mf= mean 1gram probability, and mtp=mean 2gram probability
Performance
After training, we created a testing set with 5000 random domains and 5000 English domains, then ran this algorithm against them, the results are below with blue dots for the English domains and red dots for random ones.
There are a total of 114 random domains that have a value > 0.5 and only 8 English domains that have a value < 0.5, giving a 1.22% error rate and only 0.16% false positive rate.
Conclusion
Statistical analysis of domain names can be a very useful way to find C2 traffic for many malware families currently in the wild, including unknown malware. There are many ways to do this. It is easy to get around (just smash English words together for your domain generation algorithm), and is certainly not a silver bullet, but it can be useful. We used it for labeling DNS data to do further research on finding infected hosts using Markov models which we presented at VirusBtn 2013.
Much research in this area has been done. Damballa has done work in attributing a particular domain name with a specific malware family.
 Mark as Read
 Mark as New
 Bookmark
 Highlight
 Email to a Friend
 Report Inappropriate Content
Did You also test with some nonenglish, for example German or some East European language? What About Polish, Czech, Slovak, will it still work?
Anyhow, I like this a lot.
THX
./Miro
 Mark as Read
 Mark as New
 Bookmark
 Highlight
 Email to a Friend
 Report Inappropriate Content
I am sure this information will help large number of people to focus on some of the important aspects of the topic.
 Mark as Read
 Mark as New
 Bookmark
 Highlight
 Email to a Friend
 Report Inappropriate Content
@Knapovsky_hp
You would need to rebuild your distributions to use it on languages that are not "close enough" to English. On the other hand, we had several Spanish domains in our data and it did not create false positives.

HP Security Research OSINT (OpenSourc
e Intelligen c... 
Working together toward secure developmen
t  HP Security Briefing, episode 16  Profiling an en...

The importance of languages for the profession
al d... 
HP Security Research OSINT (OpenSourc
e Intelligen c... 
Reverse Engineerin
g NAND Flash Memory – POS device... 
How to Identify (and contribute
) mobile platform v... 
HP Security Research OSINT (OpenSourc
e Intelligen c... 
HP Security Research OSINT (OpenSourc
e Intelligen c...  HP Security Briefing, episode 15  Bitcoin and sec...
 William Sze (anon) on: Hacking POS Terminal for Fun and Nonprofit

Steve Sims(anon)
on:
Technical Analysis of CVE20140
515 Adobe Flash Pl... 
am06(anon)
on:
DoubleDip
: Using the latest IE 0day to get RCE ... 
Martin Gainty(anon)
on:
Protect your Struts1 applicatio
ns 
Matt_Oh
on:
Microsoft IE zero day and recent exploitati
on tren...  alvaro_munoz on: Struts2 zero day in the wild

Matt_Oh
on:
Technical Analysis of CVE20141
761 RTF Vulnerabil ... 
Matt_Oh
on:
Reverse engineerin
g NAND Flash for fun and profit 
spovolny
on:
HP TippingPoi
nt DVLabs – ZeroDay Filter Protectio. ..  chandru4u on: Botnet Hunting with ZMap  Continuing the Hunt!