Question: Problem 6 : Grouped Means A diamond s cut is a categorical feature describing how well - proportioned the dimensions of the diamond are. This

Problem 6: Grouped Means
A diamonds cut is a categorical feature describing how well-proportioned the dimensions of the diamond are.
This feature has five possible levels. These levels are, in increasing order of quality, Fair, Good, Very Good,
Premium, and Ideal.
We will now use pair RDD tools to calculate the count, average price, and average carat size for diamonds with
each of the five levels of cut. Note that for any tuple within the diamonds RDD:
The carat size for the associated diamond is stored at index 0 of the tuple.
The cut level for the associated diamond is stored at index 1 of the tuple.
The price for the associated diamond is stored at index 6 of the tuple.
Complete the following steps in a single code cell:
1. Create a list named cut_summary by performing the transformations and action described below. Try
to perform all of the steps with a single (multi-line) statement by chaining together the methods.
Transform each observation into a tuple of the form (cut,(carat, price, 1)). Note
that the first element of this tuple indicates the cut level (which we will be grouping by), while
the second element of the tuple is another tuple containing other information in which we
are interested.
Use reduceByKey() to perform an elementwise sum of the tuples (carat, price, 1)
for each separate value of the key, which is represented by the cut value. This will produce an
RDD with 5 elements of the form (cut,(sum_of_carat, sum_of_price, count)).
Use map() to transform the tuples in the previous RDD into ones with the following form:
(cut, count, mean_carat_size, mean_price). Round the two means to 2 decimal
places.
Call the collect() method to create the desired 5 element list.
2. To better display the results, use cut_summary to create a Pandas DataFrame named cut_df. Set
the following names for the columns of the DataFrame: Cut, Count, Mean_Carat, Mean_Price.
3. Display cut_df (without using the print() function).

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!