Question: This assignment is a step - by - step guide on how to detect domains that were generated using Domain Generation Algorithm ( DGA )

This assignment is a step-by-step guide on how to detect domains that
were generated using "Domain Generation Algorithm" (DGA).
Overview 2 main steps:
Feature Engineering - from raw domain strings to numeric Machine
Learning features using DataFrame manipulations
Machine Learning Classification - predict whether a domain is le-
git or not using a Decision Tree Classifier
DGA - Background
"Various families of malware use domain generation algorithms (DGAs)
to generate a large number of pseudo-random domain names to con-
nect to a command and control (C2) server. In order to block DGA C2
traffic, security organizations must first discover the algorithm by re-
verse engineering malware samples, then generate a list of domains for
a given seed. The domains are then either preregistered, sink-holed or
published in a DNS blacklist. This process is not only tedious, but can
be readily circumvented by malware authors. An alternative approach
to stop malware from using DGAs is to intercept DNS queries on a net-
work and predict whether domains are DGA generated. Much of the
previous work in DGA detection is based on finding groupings of like
domains and using their statistical properties to determine if they are
DGA generated. However, these techniques are run over large time
windows and cannot be used for real-time detection and prevention. In
addition, many of these techniques also use contextual information
such as passive DNS and aggregations of all NXDomains throughout a
network. Such requirements are not only costly to integrate, they may
not be possible due to real-world constraints of many systems (such as
endpoint detection). An alternative to these systems is a much harder
problem: detect DGA generation on a per domain basis with no infor-
mation except for the domain name. Previous work to solve this harder
problem exhibits poor performance and many of these systems rely
heavily on manual creation of features; a time consuming process that
can easily be circumvented by malware authors..."
[Citation: Woodbridge et. al 2016: "Predicting Domain Generation Algo-
rithms with Long Short-Term Memory Networks"]
For this exercise, you will use the attached dataset:
'dga_data_small.csv".
The goals are:
Develop features to be used in the model.
Develop a supervised model and evaluate its performance.
Deliverables:
1. A brief explanation of what algorithm you used and how it per-
formed.
2. An explanation what features you extracted and how you arrived at
them.
3. The source code for your final model.
This assignment is a step - by - step guide on

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!