Question: Using PySpark SQL Data frames Question 1: The reviews column contains information such as the number of stars for each review (the rating ). The

Using PySpark SQL Data frames

Question 1:

The reviews column contains information such as the number of stars for each review (therating). The ratings column are stored in an array (reviews.stars) for each business location (you should check for yourself).Return the top five most common rating arrays. For example, an array might look like: [5, 5]

My Partial Answer Below: (Please Help Me Continue It)

from pyspark.sql import functions as F

df_SQL.printSchema()

df_SQL_aggregate = df_SQL.groupBy("id").count() df_SQL_aggregate.show()

------------------------------------------------------------------

Question 2:

For this question, you will filter out null ratings and then compute the average rating for each business location (using the field: `id`).

a) Create a new dataframe retaining two fields: `id`, `reviews.stars`

b) Create a row for each rating hint: use the `withColumn()` and `explode()` functions you will need to import the `explode()` function by issuing:

`from pyspark.sql.functions import explode`

c) Return a count of the number of ratings in this dataframe

d) Drop rows where the rating is null, and return a count of the number of non-null ratings

e) Compute the average rating, grouped by id. After the average is computed, sort by id in ascending order and show the top 10 records.

hint: this can all be done in one line using theagg()function this id should be at the top:000136e65d50c3b7|

*Don't worry about the Dataset itself, if you can just help me with the overall code, that would help*

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

What is the difference between MouseListener and MouseAdapter? [3 marks] (b) Via suitable HTML, the compiled version of the following Java code is presented to the appletviewer application: import...

Planning is one of the most important management functions in any business. A front office managers first step in planning should involve determine the departments goals. Planning also includes...

CANMNMM January of this year. (a) Each item will be held in a record. Describe all the data structures that must refer to these records to implement the required functionality. Describe all the...

The California State Auditors Office assesses the fiscal health of the states cities and posts the information on Fiscal Health of California Cities (www.auditor.ca.gov/ bsa/cities_risk_index). The...

Define a cash generating unit. Why must a discontinued operation be a cash- generating unit?

For the following questions, imagine you have a Workbook object in the variable wb, a Worksheet object in sheet, a Cell object in cell, a Comment object in comm, and an Image object in img. If you...

general broad training at the doctoral level in clinical, counseling, or school psychology including supervised practica;

Steam enters an adiabatic turbine at 7 MPa, 600C, and 80 m/s and leaves at 50 kPa, 150C, and 140 m/s. If the power output of the turbine is 6 MW, determine (a) the mass flow rate of the steam flowing...

From the following information on the activity of all French public administrations (APU), construct the accounts of the administrations and calculate the value of the various balances. Change in...

A project has the following cash flows: Year o Cash Flow 16,500 7,200 8,500 7,000 WN - a. What is the NPV at a discount rate of zero percent? (Do not round intermediate calculations and round your...

Refer to the "Plotting Data" lesson found in the Spring Force chapter of Moodle and plot the "Spring Forces (Hooke's Law)" data on the following graph. Make sure you refer to the table so you graph...

Describe Briefly How did Baosteel Europe overcome the challenges of managing a subsidiary? Refrences

Identify, as part of a review, an administrative system, to determine if there is a need to make changes or modifications to this. Performa consultation with a small group of system users and discuss...

X consigned 100 packets of cosmetics each costing Rs. 300 to his agent at Mumbai. He paid Rs. 500 towards freight and insurance. 15 packets were destroyed on the way. Consignee took delivery of the...

For the system transfer function determine the percent peak overshoot, the rise time, and the settling time. It could be done with MATLAB graphics. G(s) : = 48 10s2+14s+16

Write a program to simulate tossing three coins. A coin has two sides: one called Head and the other called Tail. The program must simulate tossing three coins until all three show Heads. Display a...

A firm has the following balance sheet: Assets Cash Accounts receivable Inventory Plant and equipment $ 15,000 150,000 92,000 170,000 $427,000 Liabilities and Equity Accounts payable Long-term debt...

How does a factory often reflect the use of distributed devices?

A multimedia version of a multivolume reference book is being prepared for storage on compact disc (CD-ROM). Each disc can store about 700 MB (megabytes). The input to each volume consists of 1000...

Base64 encoding allows arbitrary sequences of octets to be represented by printable characters. The encoding process represents 24-bit groups of input bits as strings of four encoded characters.The...