dga-1.Rmd

---
title: "Building a DGA Classifier: Data Prep"
output: html_document
---

This will be a three-part blog series on building a DGA classifier and will be split into the three phases of building a classifier: 1) Data preperation 2) Feature engineering and 3) Model selection.  And before I get too far into this, I want to give a huge thank you to Click Security for releasing a [DGA classifier](https://github.com/ClickSecurity/data_hacking/tree/master/dga_detection) in python as part of their very nice [Data Hacking github repo](https://github.com/ClickSecurity/data_hacking).  If you'd like to follow along in python I won't be deviating much from their original code.  Most of what I did was recreate their work in R and I'll only deviate away from their work to experiment and explain some steps or the thought process.  The code, data and markdown scripts for this series is [on github](https://github.com/jayjacobs/dga-tutorial)

### A little background on DGA

DGA stands for Domain Generating Algorithm and these algorithms are part of the evolution of malware communications. In the beginning malware would be hardcoded with IP address(es) or domain names and the botnet could be disrupted by going after whatever was hardcoded.  The purpose of the DGA is to be deterministic and generate a whole lot of domains, of which the bot maintainer only has to register one or more of them to enable the bots to find a controlling server.   If the domain or IP are taken down, a new name from the algorithm can be used by the botnet maintainer and the botnet maintained.  However, if we could build a classifer to seperate domains generated by an algorithm from legitamate domains, we could better identify malicious activity from DNS or proxy logs.

### And where we are headed

I wanted the use to be very straight forward and not require experience in machine learning.  I don't think it gets any easier than this:

Install the package:
```{r message=FALSE, warning=FALSE, eval=FALSE}
devtools::install("jayjacobs/dga")
```

```{r message=FALSE, warning=FALSE}
library(dga)

# known good domains:
good <- c("facebook.com", "google.com", "sina.com.cn", 
          "twitter.com", "yandex.ru", "msn.com")
# DGA domains generated by cryptolocker
bad <- c("kqcrotywqigo.ru", "rlvukicfjceajm.ru", "ibxaoddvcped.ru", 
         "tntuqxxbvxytpif.ru", "heksblnvanyeug.ru", "keaeodsrfafqpdp.ru")

# classify the domains as either "legit" or "dga"
dgaPredict(c(good, bad))
```

The function returns the domain name it extracted from the names, the classification the algorithm it assigned and the probability the classifier used to classify.   You can see from the output, that all of these domains were clearly classified (with high probability).

So that's what these posts are going to walk through... what steps did we go through to answer the relatively easy question of "_Is this domain legitimate or generating by an algorithm?_"

# Step 1: Get the data
The first major step in any classifer is getting training data.  If that term is new to you, think of training data like an answer key to a test.  We want a list of the questions (domain/host names) and the associated answer (whether each is "legit" or "dga").  In some cases, establishing reliable training data is a huge challenge, but in this case we're lucky.  All we need is a list of good/legitimate domains and a second list of domains generated by an algorithm. In the [example from click security](https://github.com/ClickSecurity/data_hacking/tree/master/dga_detection/data), they offer several data sets that we could copy, but we'll seek out own list. 

### Alexa
An obvious choice is to go to the Alexa list of top web sites.  But it's not really conducive for our use as is.  If you grab the [top 1 Million Alexa domains](http://s3.amazonaws.com/alexa-static/top-1m.csv.zip) and parse it, you'll find just over 11 thousand are full URLs and not just domains, and there are thousands of domains with subdomains that don't help us (we are only classifying on domains here).  So after I remove the URLs and de-duplicated the domains, I end up with the Alexa top 972,544.  For now, I save off the whole thing ([alexa-972k.rda])


```{r eval=FALSE, echo=FALSE}
library(data.table)
alexa <- fread("data//top-1m.csv", header=F)[[2]]
sum(duplicated(alexa))  # zero
library(tldextract)
system.time(alexa.clean <- tldextract(alexa))  # 32 seconds for me
summary(factor(alexa.clean$domain))  # quite a few duplicated
alexa.ok <- !is.na(alexa.clean$domain)
alexa <- unique(paste(alexa.clean$domain[alexa.ok], 
                      alexa.clean$tld[alexa.ok], sep="."))
write.csv(alexa.uniq, file="data/alexa-top-972k.csv", row.names=F)
save(alexa, file="data/alexa-972k.rda", compress=T)