Question: Problem1 Load the included data set iris.data into a pyspark dataframe; Load it with a pre-defined schema Let the column names be sepal_length,sepal_width,petal_length,petal_width,species. For the

Problem1

Load the included data set iris.data into a pyspark dataframe;

Load it with a pre-defined schema

Let the column names be sepal_length,sepal_width,petal_length,petal_width,species.

For the numerical columns, use the most efficient data type in the schema definition.

Explain why this is the most efficient data type (in no more than 2 setences).

Problem2

Using the loaded dataframe from problem 1, runuse pyspark dataframe methods that does the following;

Finds the average value of each column, grouped by species

For each average, round it to 3 decimal places

For each averaged column, label them as avg_{column name} as the output in the dataframe.

Order it by species in ASCENDING order

Problem 3

Make it into a python script such that in the next cell I can easily run !python problem4 data/iris.data and the output will look exactly like the output of problem 2

Hints:

When reading into the mapper, there may be errors. Handle this with try and except clauses, passing on any case with an error *Solve this whole problem in two MRJob steps using the MRStep package.

Step 1 - Goal is to acquire the grouped average calculations for each column; map each species to a list of values associated with the given row being read.

Step 2 - Goal is to sort the output of the species, such that they are in alphabetical order;

For the output of the mapper of step2, use the same key for all outputs. In step 2 reducer, sort by the values. Parse step 2's reducer output so it yields species_name, avg_column_name1, avg_column_name2, etc.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!