Question: Problem1 Load the included data set iris.data into a pyspark dataframe; Load it with a pre-defined schema Let the column names be sepal_length,sepal_width,petal_length,petal_width,species. For the

Problem1

Load the included data set iris.data into a pyspark dataframe;

Load it with a pre-defined schema

Let the column names be sepal_length,sepal_width,petal_length,petal_width,species.

For the numerical columns, use the most efficient data type in the schema definition.

Explain why this is the most efficient data type (in no more than 2 setences).

Problem2

Using the loaded dataframe from problem 1, runuse pyspark dataframe methods that does the following;

Finds the average value of each column, grouped by species

For each average, round it to 3 decimal places

For each averaged column, label them as avg_{column name} as the output in the dataframe.

Order it by species in ASCENDING order

Problem 3

Make it into a python script such that in the next cell I can easily run !python problem4 data/iris.data and the output will look exactly like the output of problem 2

Hints:

When reading into the mapper, there may be errors. Handle this with try and except clauses, passing on any case with an error *Solve this whole problem in two MRJob steps using the MRStep package.

Step 1 - Goal is to acquire the grouped average calculations for each column; map each species to a list of values associated with the given row being read.

Step 2 - Goal is to sort the output of the species, such that they are in alphabetical order;

For the output of the mapper of step2, use the same key for all outputs. In step 2 reducer, sort by the values. Parse step 2's reducer output so it yields species_name, avg_column_name1, avg_column_name2, etc.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Please help code the following inn python. Thank you All code must be executable from scratch. Code should be written properly (i.e. put code in functions as needed, declare variables as needed, and...

1. How much longer will the project take? (Detailed Answer) Chris, what do you mean, we can't deploy the system on time? It's a month before the end of the project and now you're telling me that we...

2. What should Martin have done earlier in the project timeline to prevent delays? ( detailed answer) Chris, what do you mean, we can't deploy the system on time? It's a month before the end of the...

Scenario Motorhome Spares Wales (MSW) is a company that sells new and second-hand replacement parts for motor homes and camper vans. It does this primarily from its network of 13 shops in and around...

Mailings Review View Help Draw Design Layout References Cambria 16 A A A A Uab , *'APA 1 E EE Font Paragraph Question 2.3 For this question, the OLTP database is the tales of the KMOSE schema. Assume...

Chris, what do you mean, we can't deploy the system on time? It's a month before the end of the project and now you're telling me that we may not deploy on time! Do you know what this means? Don't...

I need help with answering these questions, please. 1) Ryanair is considering increasing its average fares to 45 euros. Give two reasons for and two reasons against. Based on these four items, what...

3.What should Martin do when managing future projects to prevent similar problems from developing? (detailed answer) Chris, what do you mean, we can't deploy the system on time? It's a month before...

Chris, what do you mean, we can't deploy the system on time? It's a month before the end of the project and now you're telling me that we may not deploy on time! Do you know what this means? Don't...

Using financial investment strategies and theories. Review the Investment & Diversification resource (photo included below) and think about how they can be applied to your own personal finance...

For n = 12 year-2007 model sedans whose horsepower is between 290 and 390, the following measurements give the time in seconds for the car to go from 0 to 60 mph: (a) Find a 96.14% confidence...

Is it possible for investors ever to require a lower rate of return on a companys equity than on its debt, assuming that the debt is in a junk-bond category of quality?

Potients, Inc. amortizes its patents over ten years. The entry to record the amortization of its patents may include a Blank _ _ _ _ _ _ . ( Check all that apply. ) Multiple select question. debit to...

Emma died owning real estate with a value of 500000 Emma received the property in 1994 as a gift from her mother Charlotte at the time of the gift the property was worth 250000 Charlotte purchased...

1 Sketch out the main processes between a customer placing an enquiry and receiving delivery of a WDT transformer. Where has WDT really scored in terms of reducing this time? Sid Beckett, the...

3 Identify six potential sources and causes of risk in global supply chains. Use the reference to Peck (2003) below to propose counter measures.

1 Why is time important to competitive advantage? Identify and explain six key contributions that speed can make to logistics strategy.