Question: Currently the script associated with Example 5 in Week 5 shows a list of how many unique IP addresses are present in the log. However,

Currently the script associated with Example 5 in Week 5 shows a list of how many unique IP addresses are present in the log. However, such list may not be sufficient without a count associated with the number of unexpected accesses to a particular resource. For this assignment you are tasked to modify the script from Week 5 (which is shown below) to show the following:

A list of IP addresses in ascending order,

How many times an IP address is reported in the log, and

Whether an IP address is in a list of known IP addresses.

Script Example From Week 5 Below

import os

from pathlib import Path

import re

# Getting the directory that contains the script, so all file operations will take place in that directory

script_home_dir = os.path.dirname(os.path.abspath(__file__))

sample_file = 'HDFS_2k.log'

# This list will host the content of the file

file_content = []

# Reading the file

with open(Path(script_home_dir, sample_file), 'r') as my_file:

file_content = my_file.readlines()

# Trying to match only valid IP addresses (0-255) - Source: https://ihateregex.io/expr/ip/

digits_pattern = re.compile(r'(\b25[0-5]|\b2[0-4][0-9]|\b[01]?[0-9][0-9]?)(\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)){3}')

print(f" Search pattern: {digits_pattern}")

ip_addresses_in_log = []

# Extracting multiple IP addresses from each line

# Working on each line to match the pattern, using a different method that can get all matches

for i, line in enumerate(file_content):

line = line.strip()

search_outcome = digits_pattern.finditer(line) # This is the new way to attempting to find matches within the line

print(f'Line {i}: {line}')

if search_outcome != None:

# print(search_outcome)

for n, ip in enumerate(search_outcome):

ip_addr = ip.group()

print(f'--- Match {n}: {ip_addr}')

if ip_addr not in ip_addresses_in_log:

ip_addresses_in_log.append(ip_addr)

# Limiting the loop to the first few lines - Remove the if block below to run through the entire log file

if i > 100:

break

# ----------------------------------------

print(" Distinct IP addresses in the log:")

for ip_addr in ip_addresses_in_log:

print(f'IP: {ip_addr}')

An example of the output is the following:

IP Address

Count

Expected

10.50.100.150

72

Yes

10.50.100.152

100

Yes

10.50.100.155

46

No

Please note that the table above is just a depiction and I am not expecting an actual table.

The source of the analysis should be the HDFS_2k.log file, which was used as a Data File 2 for Week 5, attached to this assignment. The list of known IP addresses is also attached to this assignment.

Notes

Your scripts should be fully commented, including who is the author, purpose of the script, and date

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!