Question: Problem1 Load the included data set iris.data into a pyspark dataframe; Load it with a pre-defined schema Let the column names be sepal_length,sepal_width,petal_length,petal_width,species. For the
Problem1
Load the included data set iris.data into a pyspark dataframe;
Load it with a pre-defined schema
Let the column names be sepal_length,sepal_width,petal_length,petal_width,species.
For the numerical columns, use the most efficient data type in the schema definition.
Explain why this is the most efficient data type (in no more than 2 setences).
Problem2
Using the loaded dataframe from problem 1, runuse pyspark dataframe methods that does the following;
Finds the average value of each column, grouped by species
For each average, round it to 3 decimal places
For each averaged column, label them as avg_{column name} as the output in the dataframe.
Order it by species in ASCENDING order
Problem 3
Make it into a python script such that in the next cell I can easily run !python problem4 data/iris.data and the output will look exactly like the output of problem 2
Hints:
When reading into the mapper, there may be errors. Handle this with try and except clauses, passing on any case with an error *Solve this whole problem in two MRJob steps using the MRStep package.
Step 1 - Goal is to acquire the grouped average calculations for each column; map each species to a list of values associated with the given row being read.
Step 2 - Goal is to sort the output of the species, such that they are in alphabetical order;
For the output of the mapper of step2, use the same key for all outputs. In step 2 reducer, sort by the values. Parse step 2's reducer output so it yields species_name, avg_column_name1, avg_column_name2, etc.
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
