Question: Using PySpark SQL Data frames Question 1: The reviews column contains information such as the number of stars for each review (the rating ). The

Using PySpark SQL Data frames

Question 1:

The reviews column contains information such as the number of stars for each review (therating). The ratings column are stored in an array (reviews.stars) for each business location (you should check for yourself).Return the top five most common rating arrays. For example, an array might look like: [5, 5]

My Partial Answer Below: (Please Help Me Continue It)

from pyspark.sql import functions as F

df_SQL.printSchema()

df_SQL_aggregate = df_SQL.groupBy("id").count() df_SQL_aggregate.show()

------------------------------------------------------------------

Question 2:

For this question, you will filter out null ratings and then compute the average rating for each business location (using the field: `id`).

a) Create a new dataframe retaining two fields: `id`, `reviews.stars`

b) Create a row for each rating hint: use the `withColumn()` and `explode()` functions you will need to import the `explode()` function by issuing:

`from pyspark.sql.functions import explode`

c) Return a count of the number of ratings in this dataframe

d) Drop rows where the rating is null, and return a count of the number of non-null ratings

e) Compute the average rating, grouped by id. After the average is computed, sort by id in ascending order and show the top 10 records.

hint: this can all be done in one line using theagg()function this id should be at the top:000136e65d50c3b7|

*Don't worry about the Dataset itself, if you can just help me with the overall code, that would help*

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!