Question: Check the schemas: [ ] gameclicks.printSchema ( ) root T- timestamp: string (nullable true) clickid: string (nullable true) userId: string (nullable = true) userSessionId: string
![Check the schemas: [ ] gameclicks.printSchema ( ) root T- timestamp:](https://dsd5zvtm8ll6.cloudfront.net/si.experts.images/questions/2024/09/66f392ee2de35_89366f392eda9794.jpg)



Check the schemas: [ ] gameclicks.printSchema ( ) root T- timestamp: string (nullable true) clickid: string (nullable true) userId: string (nullable = true) userSessionId: string (nullable = true) ishit: string (nullable true) teamId: string (nullable = true) |-- teamLevel: string (nullable = true) adclicks.printSchema () = E root T- timestamp: string (nullable true) txId: string (nullable true) userSessionId: string (nullable true) teamId: string (nullable = true) userId: string (nullable = true) |-- adid: string (nullable = true) | -- adCategory: string (nullable = true) Question 1: How many users in each team? Keywords: Dataframe API, SQL, group by, sort Use DataFrame API to group the users by teamID and count how many distinct users in each team. Sort the result in descending order. Indented block [ ] team_counts = # your code goes here (gla: 4 points) team_counts.show(). Now rewrite the above question using pure SQL: [ ] gameclicks.registerTemptable("gameclicks") query = # your code goes here (Q1b: 2 points) team_counts = spark.sql(query) team_counts.show() Questions 2: Now use the ad-clicks dataset to find the number of ad clicks in each hour. Keywords: group by, parse timestamp, plot timestamp_only adclicks.selectExpr(["to_timestamp (timestamp) as timestamp"]) click_count_by_hour = # your code goes here (Q2a: 4 points) click_count_by_hour.show(24) Check the schemas: [ ] gameclicks.printSchema ( ) root T- timestamp: string (nullable true) clickid: string (nullable true) userId: string (nullable = true) userSessionId: string (nullable = true) ishit: string (nullable true) teamId: string (nullable = true) |-- teamLevel: string (nullable = true) adclicks.printSchema () = E root T- timestamp: string (nullable true) txId: string (nullable true) userSessionId: string (nullable true) teamId: string (nullable = true) userId: string (nullable = true) |-- adid: string (nullable = true) | -- adCategory: string (nullable = true) Question 1: How many users in each team? Keywords: Dataframe API, SQL, group by, sort Use DataFrame API to group the users by teamID and count how many distinct users in each team. Sort the result in descending order. Indented block [ ] team_counts = # your code goes here (gla: 4 points) team_counts.show(). Now rewrite the above question using pure SQL: [ ] gameclicks.registerTemptable("gameclicks") query = # your code goes here (Q1b: 2 points) team_counts = spark.sql(query) team_counts.show() Questions 2: Now use the ad-clicks dataset to find the number of ad clicks in each hour. Keywords: group by, parse timestamp, plot timestamp_only adclicks.selectExpr(["to_timestamp (timestamp) as timestamp"]) click_count_by_hour = # your code goes here (Q2a: 4 points) click_count_by_hour.show(24)
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
