Question: Using RDDs write a code to answer the following questions ( Q 1 - Q 5 ) using given . csv files. Q 1 :

Using RDDs write a code to answer the following questions (Q1-Q5) using given .csv files.
Q1: For the time range between 2017-03-2222:00 and 2017-03-2223:00, find the 5 most
used servers. Results to be given in descending order of servers.
Tips: For this query you will need to filter out the records that have null values so that they
are not taken into account in the calculation. Also, you will need to process the date with an
appropriate Python library.
Q2: For the target URL
"xxx" in warc.csv file, find the
content length of the metadata as well as the size of HTML DOM (number of characters).
Tips: For this query you should filter by url. Remember to restart
the Spark cluster before each measurement, to avoid hot caches, or you can clear the cache
with the command spark.catalog.clearCache()

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!