Question: Consider 3 files warc.csv , wat.csv and we . csv without headers. Consider a warc.csv file related data. An indicative line is: 2 0 1

Consider

3

files warc.csv

,

wat.csv and we

.

csv without headers.

Consider a warc.csv file related data. An indicative line is:

2017 - 03 - 22 T 22

13

57 Z,

urn, response,

40802, 213.155.18.48

, h t t p

:coco, Apache,

h t m l

Columns in order: first the warc date, the warc record id

,

the warc type

(

.

.

metadata,

response, etc

),

the content length, the public IP address, the target URL, the server running

the site

(

eg apache, nginx, etc

),

and finally the overall content of the page with the entire

HTML DOM.

Consider a wat.csv file related data. An indicative line is:

urn:uuid,

1053,

http:coco

In order the columns are: first the warc record id

,

the content length of the metadata, and

finally the target URL

(

it can be different from the target URL of the warc data

) .

Consider a wet.csv file related data. An indicative line is:

urn:uuid, "extracted plaintext"

In order the columns are: first the warc record id and then the extracted plaintext from the

url

(

can be in ascii

) .

Using RDDs write a Pyhton code to answer the following.

Task

1

Find the most popular target URL

(

,

the record target URL that can be found in the HTML

DOM of another record.

Tips: You will need to join datasets to get the desired result. For this query you will need to

filter out the records that have null values. You should first find for each warc record what

its target URL is and what URLs are in the HTML DOM, so you get an intermediate result:

targetURL

- >

list

(

urls in html dom

) .

For the URLs you could simplify them and keep a

simpler format

/

subdomain to get even more results.

Remember to restart the Spark cluster before each measurement, to avoid hot caches, or

you can clear the cache.

Task

2

Perform Task

1

using DataFrames

/

Spark SQL and parquet file

Consider 3 files warc.csv, wat.csv and we.csv without headers. Consider a

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Consider 3 files warc.csv , wat.csv and we . csv . Consider a warc.csv file related data. An indicative line is: 2 0 1 7 - 0 3 - 2 2 T 2 2 : 1 3 : 5 7 Z , urn:coco, response, 4 0 8 0 2 , 2 1 3 . 1 5...

Consider a warc.csv file related data. An indicative line is: 2 0 1 7 - 0 3 - 2 2 T 2 2 : 1 3 : 5 7 Z , , , response, 4 0 8 0 2 , 2 1 3 . 1 5 5 . 1 8 . 4 8 , h t t p :coco , Apache , Columns in...

Consider a warc.csv file related data. An indicative line is: 2 0 1 7 - 0 3 - 2 2 T 2 2 : 1 3 : 5 7 Z , , , response , 4 0 8 0 2 , 2 1 3 . 1 5 5 . 1 8 . 4 8 , h t t p :coco, Apache, ' ' Columns in...

Consider a warc.csv file related data. An indicative line is: 2 0 1 7 - 0 3 - 2 2 T 2 2 : 1 3 : 5 7 Z , Columns in order: first the warc date, the warc record id , the warc type ( e . g . metadata,...

HEADER FILES BmpProcessor.h Source Code: struct BMP_Header { char signature[2]; // ID Field int size; // Size of the BMP File int offset; // Offset where the pixel array can be found short reserved1;...

HEADER FILES BmpProcessor.h struct BMP_Header { char signature[2]; // ID Field int size; // Size of the BMP File int offset; // Offset where the pixel array can be found short reserved1; // Program...

Working on the City of Smithville Short version. I need the journal entries for Chapters 5, 6, and 9 (including Budgetary) ASAP. Thank you for your help. Instructions City of Smithville Short Version...

Change the Cartesian integral into Polar integral and evaluate it. 4-y x(x + y)dx dy

Suppose we have independent and identically distributed observations y = (y,..., yn), with yi Exponential (A). (a) Suppose that we specify ~ Gamma(a, ) as the prior distribution on the unknown model...

Question 7 pts On Jmaary 1 . a corigary Igues bonds dated Jawary 1 with a par value of $ 3 0 0 0 0 0 . The bonds mature in 5 yeark. The are sold for $ 3 1 2 . 1 7 7 . The jormal entry to recod the...

A random sample of 43 biology students in a science program are selected for a study. Of those selected, only 27 passed the mid term exam. At the 5% significance level, is there sufficient evidence...

=+4. Describe cost per click, cost per conversion, cost per engagement, and cost per action. In what ways could these be integrated into a social media campaign?

=+5. You have been asked to create an influencer guide to target micro-influencers for a local restaurant. What are some reasons you would suggest the restaurant implement this type of program...

=+4. Outline the different media in the PESO model. What are some examples of content that could be created for each?