Question: Spark stand alone analytics To carry out some basic pre-processing steps to prepare big data environment for machine learning. To convert native types to Spark

Spark stand alone analytics

To carry out some basic pre-processing steps to prepare big data environment for machine learning.

To convert native types to Spark types.

Spark

Datasets indicated in the exercises

VMware platform

1-Load two days of data from the retail sales data available on the VMware image under the directory /home/centos/data/retail-data/by-day/ into a dataframe and name it df_

(use infer schema)

data file look like in centos at VMware platform

please load the data for ninth and the tenth of January

the data is in centos in linux at VMware platform

2- Check the UI at the local host port 4040 or the port that spark connects to when launched and record the following in analysis report:

The time it took to load the data.

The number of tasks and try to explain what happened in the analysis report.

Take a screenshot of the DAG execution and add it to your analysis report.

For points 3&4,5,6&7 use the Dataframe high level API, make sure you show the full column content, i.e. no truncation.

3- Carry out some basic investigation: count the number of records, print the inferred schema. Record the results in your analysis report.

4- Show all the transactions that are related to the purchase of stock id that starts with "227" with the type of product "ALARM CLOCK" mentioned as part of the description or a unit price greater than 5.

5- Store the results into a new dataframe name it df2_firstname.

6- Show the sum of the quantities ordered and the minimum quantity order and the maximum quantity order for the transactions you extracted in point 4 above. Investigate the UI take a screenshot of the DAG plan and in your" Analysis report" add the number of stages the job required with the total time required per stage in addition to the number of tasks required for each job. Finally, drill down on each stage and produce the DAG graph for each stage and analyze the statistics, note the shuffle size and the number of partitions in your report.

7- Show all the transactions mentioned in point 4 above that have originated form outside the United Kingdom.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Accounting Questions!