Question: This assignment is a step - by - step guide on how to detect domains that were generated using Domain Generation Algorithm ( DGA )
This assignment is a stepbystep guide on how to detect domains that
were generated using "Domain Generation Algorithm" DGA
Overview main steps:
Feature Engineering from raw domain strings to numeric Machine
Learning features using DataFrame manipulations
Machine Learning Classification predict whether a domain is le
git or not using a Decision Tree Classifier
DGA Background
"Various families of malware use domain generation algorithms DGAs
to generate a large number of pseudorandom domain names to con
nect to a command and control C server. In order to block DGA C
traffic, security organizations must first discover the algorithm by re
verse engineering malware samples, then generate a list of domains for
a given seed. The domains are then either preregistered, sinkholed or
published in a DNS blacklist. This process is not only tedious, but can
be readily circumvented by malware authors. An alternative approach
to stop malware from using DGAs is to intercept DNS queries on a net
work and predict whether domains are DGA generated. Much of the
previous work in DGA detection is based on finding groupings of like
domains and using their statistical properties to determine if they are
DGA generated. However, these techniques are run over large time
windows and cannot be used for realtime detection and prevention. In
addition, many of these techniques also use contextual information
such as passive DNS and aggregations of all NXDomains throughout a
network. Such requirements are not only costly to integrate, they may
not be possible due to realworld constraints of many systems such as
endpoint detection An alternative to these systems is a much harder
problem: detect DGA generation on a per domain basis with no infor
mation except for the domain name. Previous work to solve this harder
problem exhibits poor performance and many of these systems rely
heavily on manual creation of features; a time consuming process that
can easily be circumvented by malware authors..."
Citation: Woodbridge et al : "Predicting Domain Generation Algo
rithms with Long ShortTerm Memory Networks"
For this exercise, you will use the attached dataset:
'dgadatasmall.csv
The goals are:
Develop features to be used in the model.
Develop a supervised model and evaluate its performance.
Deliverables:
A brief explanation of what algorithm you used and how it per
formed.
An explanation what features you extracted and how you arrived at
them.
The source code for your final model.
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
