Determine how many users have received more than 5000 cool compliments. Create a variable user_count...
Fantastic news! We've Found the answer you've been seeking!
Question:
Transcribed Image Text:
Determine how many users have received more than 5000 "cool" compliments. • Create a variable user_count (an integer) which contains the number of user with more than 5000 "cool" compliments (using the compliment_ field.) In [ ] # YOUR CODE HERE raise NotImplementedError() In [ ] assert type (user_count) == int, "The user_count variable should be an integer." In [ ] # Autograder cell. This cell is worth 2 points (out of 20). This cell contains hidden tests. Question 2 Determine the top 5 most useful positive reviews. • Create a variable top_5_useful_positive. This should be a PySpark DataFrame . For this question a "positive review" is one with 4 or 5 stars The DataFrame should be ordered by useful and contain 5 rows • The DataFrame should have these columns (in this order): ■ review_id ■ useful I stars. ● -- In [ ] # YOUR CODE HERE Useful Positive Reviews In [] import pyspark raise NotImplementedError() assert type (top_5_useful_positive) pyspark.sql.dataframe.DataFrame, \ "The top_useful_positive variable should be a Spark DataFrame." == assert top_5_useful_positive.columns "The columns are not in the correct order.' ['review_id', 'useful', 'stars'], \ submitted = AutograderHelper.parse_spark_dataframe (top_5_useful_positive) assert len (submitted) == 5, \ In [ ] # Autograder cell. This cell is worth 1 point (out of 20). This cell does not contain hidden tests. # This cell deliberately includes answers to provide guidance on how this question is graded. "The result must have 5 rows." top_useful_review_id = "11GX1yq4MALOMx17vpBcOQ" assert submitted [ "review_id"][0] ==top_useful_review_id, \ f'The first row should have review_id "{top_useful_review_id}" (this review has the most "useful" ratings) In [] #Autograder cell. This cell is worth 4 points (out of 20). This cell contains hidden tests. Question 3 -- Checkins Determine what hours of the day most checkins occur. • Create a variable hours_by_checkin_count. This should be a PySpark DataFrame • The DataFrame should be ordered by count and contain 24 rows • The DataFrame should have these columns (in this order): hour (the hour of the day as an integer, the hour after midnight being 0) ■ count (the number of checkins that occurred in that hour) Note that the date column in the checkin data is a string with multiple date times in it. You'll need to split that string before parsing. In [ ] # YOUR CODE HERE raise NotImplementedError() In [ ]: assert type (hours_by_checkin_count) == pyspark.sql.dataframe. DataFrame, "The hours_by_checkin_count variable should be a Spark DataFrame." assert hours_by_checkin_count.columns == ["hour", "count"], \ "The columns are not in the correct order." submitted = AutograderHelper.parse_spark_dataframe (hours_by_checkin_count) In [ ] #Autograder cell. This cell is worth 1 point (out of 20). This cell does not contain hidden tests. assert len (submitted) == 24, \ "The hours_by_checkin_count DataFrame must have 24 rows. assert submitted [ "hour"][0] == 1, \ 'The first row should have hour 1' In [ ] # Autograder cell. This cell is worth 4 points (out of 20). This cell contains hidden tests. Question 4 -- Common Words in Useful Reviews Write function that takes a Spark DataFrame as a parameter and returns a Spark DataFrame of the 50 most common words from useful reviews and their counts. • A "useful review" has 10 or more "useful" ratings. . Convert the text to lower case. • Use the provided splitter() function in a UDF to split the text into individual words. Exclude the words in the provided STOP WORDS set. Returned DataFrame should have these columns (in this order): ■ word . count • Returned DataFrame should be sorted by count in descending order. . } . In [ ] import re def splitter (text): WORD_RE= re.compile(r"[\w']+") return WORD_RE.findall(text) STOP WORDS = { a "about", "above", "after", "again", "against", "aint", "all", "also", "although", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "can", "check", "checked", "could", "did", "do", "does", "doing", "don", "down", "during", "each", "few", "for", "from", "further", "get", "go", "got", "had", "has", "have", "having", "he", "her", "here", "hers", "herself", "him", "himself", "his", "how", "however", "i", "i'd", "if", "i'm", "in", "into", "is", "it", "its", "it's", "itself", "i've", "just", "me", "more", "most", "my", "myself", "no", "nor", "not", "now", "of", "off", "on", "once", "one", "online", "only", "or", "other", "our", "ours", "ourselves", "out", "over", "own", "paid", "place", "s", "said", "same", "service", "she", "should", "so", "some", "such", "t", "than", "that", "the", "their", "theirs", "them", "themselves", "then", "there", "these", "they", "this", "those", "through", "to", "too", "under", "until", "up", "us", "very", "was", "we", "went", "were", "we've", "what", "when", "where", "which", "while", "who", "whom", "why", "will", "with", "would", "you", "your", "yours", "yourself", "yourselves", def common_useful_words (reviews, limit=50): #YOUR CODE HERE raise Not ImplementedError() return most common Now we'll run it on the review DataFrame In [ ]: common_useful_words_counts = common_useful_words (review) In [ ] assert type (common_useful_words_counts) == pyspark.sql.dataframe.DataFrame, \ "The common_useful_words_counts variable should be a Spark DataFrame." assert common_useful_words_counts.columns == ["word", "count"], \ "The columns are not in the correct order." submitted = AutograderHelper.parse_spark_dataframe (common_useful_words_counts) In [ ] # Autograder cell. This cell is worth 2 points (out of 20). This cell does not contain hidden tests. assert len (submitted) == 50, \ "The common_useful_words_counts DataFrame must have 50 rows." assert submitted [ "word"][0] == 'like', \ 'The first row should have word "like"" assert submitted [ "count"][0]==101251, \ 'The first row should have count 101251' In [ 1: # Autograder cell I This cell is worth 6 points out of 201 This cell contains hidden tests. Determine how many users have received more than 5000 "cool" compliments. • Create a variable user_count (an integer) which contains the number of user with more than 5000 "cool" compliments (using the compliment_ field.) In [ ] # YOUR CODE HERE raise NotImplementedError() In [ ] assert type (user_count) == int, "The user_count variable should be an integer." In [ ] # Autograder cell. This cell is worth 2 points (out of 20). This cell contains hidden tests. Question 2 Determine the top 5 most useful positive reviews. • Create a variable top_5_useful_positive. This should be a PySpark DataFrame . For this question a "positive review" is one with 4 or 5 stars The DataFrame should be ordered by useful and contain 5 rows • The DataFrame should have these columns (in this order): ■ review_id ■ useful I stars. ● -- In [ ] # YOUR CODE HERE Useful Positive Reviews In [] import pyspark raise NotImplementedError() assert type (top_5_useful_positive) pyspark.sql.dataframe.DataFrame, \ "The top_useful_positive variable should be a Spark DataFrame." == assert top_5_useful_positive.columns "The columns are not in the correct order.' ['review_id', 'useful', 'stars'], \ submitted = AutograderHelper.parse_spark_dataframe (top_5_useful_positive) assert len (submitted) == 5, \ In [ ] # Autograder cell. This cell is worth 1 point (out of 20). This cell does not contain hidden tests. # This cell deliberately includes answers to provide guidance on how this question is graded. "The result must have 5 rows." top_useful_review_id = "11GX1yq4MALOMx17vpBcOQ" assert submitted [ "review_id"][0] ==top_useful_review_id, \ f'The first row should have review_id "{top_useful_review_id}" (this review has the most "useful" ratings) In [] #Autograder cell. This cell is worth 4 points (out of 20). This cell contains hidden tests. Question 3 -- Checkins Determine what hours of the day most checkins occur. • Create a variable hours_by_checkin_count. This should be a PySpark DataFrame • The DataFrame should be ordered by count and contain 24 rows • The DataFrame should have these columns (in this order): hour (the hour of the day as an integer, the hour after midnight being 0) ■ count (the number of checkins that occurred in that hour) Note that the date column in the checkin data is a string with multiple date times in it. You'll need to split that string before parsing. In [ ] # YOUR CODE HERE raise NotImplementedError() In [ ]: assert type (hours_by_checkin_count) == pyspark.sql.dataframe. DataFrame, "The hours_by_checkin_count variable should be a Spark DataFrame." assert hours_by_checkin_count.columns == ["hour", "count"], \ "The columns are not in the correct order." submitted = AutograderHelper.parse_spark_dataframe (hours_by_checkin_count) In [ ] #Autograder cell. This cell is worth 1 point (out of 20). This cell does not contain hidden tests. assert len (submitted) == 24, \ "The hours_by_checkin_count DataFrame must have 24 rows. assert submitted [ "hour"][0] == 1, \ 'The first row should have hour 1' In [ ] # Autograder cell. This cell is worth 4 points (out of 20). This cell contains hidden tests. Question 4 -- Common Words in Useful Reviews Write function that takes a Spark DataFrame as a parameter and returns a Spark DataFrame of the 50 most common words from useful reviews and their counts. • A "useful review" has 10 or more "useful" ratings. . Convert the text to lower case. • Use the provided splitter() function in a UDF to split the text into individual words. Exclude the words in the provided STOP WORDS set. Returned DataFrame should have these columns (in this order): ■ word . count • Returned DataFrame should be sorted by count in descending order. . } . In [ ] import re def splitter (text): WORD_RE= re.compile(r"[\w']+") return WORD_RE.findall(text) STOP WORDS = { a "about", "above", "after", "again", "against", "aint", "all", "also", "although", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "can", "check", "checked", "could", "did", "do", "does", "doing", "don", "down", "during", "each", "few", "for", "from", "further", "get", "go", "got", "had", "has", "have", "having", "he", "her", "here", "hers", "herself", "him", "himself", "his", "how", "however", "i", "i'd", "if", "i'm", "in", "into", "is", "it", "its", "it's", "itself", "i've", "just", "me", "more", "most", "my", "myself", "no", "nor", "not", "now", "of", "off", "on", "once", "one", "online", "only", "or", "other", "our", "ours", "ourselves", "out", "over", "own", "paid", "place", "s", "said", "same", "service", "she", "should", "so", "some", "such", "t", "than", "that", "the", "their", "theirs", "them", "themselves", "then", "there", "these", "they", "this", "those", "through", "to", "too", "under", "until", "up", "us", "very", "was", "we", "went", "were", "we've", "what", "when", "where", "which", "while", "who", "whom", "why", "will", "with", "would", "you", "your", "yours", "yourself", "yourselves", def common_useful_words (reviews, limit=50): #YOUR CODE HERE raise Not ImplementedError() return most common Now we'll run it on the review DataFrame In [ ]: common_useful_words_counts = common_useful_words (review) In [ ] assert type (common_useful_words_counts) == pyspark.sql.dataframe.DataFrame, \ "The common_useful_words_counts variable should be a Spark DataFrame." assert common_useful_words_counts.columns == ["word", "count"], \ "The columns are not in the correct order." submitted = AutograderHelper.parse_spark_dataframe (common_useful_words_counts) In [ ] # Autograder cell. This cell is worth 2 points (out of 20). This cell does not contain hidden tests. assert len (submitted) == 50, \ "The common_useful_words_counts DataFrame must have 50 rows." assert submitted [ "word"][0] == 'like', \ 'The first row should have word "like"" assert submitted [ "count"][0]==101251, \ 'The first row should have count 101251' In [ 1: # Autograder cell I This cell is worth 6 points out of 201 This cell contains hidden tests.
Expert Answer:
Answer rating: 100% (QA)
1Here is the code to determine how many users have received more than 5000 cool compliments PYTHON from pysparksql import SparkSession spark SparkSessionbuilderappNameCool ComplimentsgetOrCreate Load ... View the full answer
Related Book For
Business Analytics Methods Models and Decisions
ISBN: 978-0321997821
2nd edition
Authors: James R. Evans
Posted Date:
Students also viewed these programming questions
-
The forces in (Figure 1) act on a 1.1 kg object. Part A What is the value of a, the z-component of the object's acceleration? Express your answer with the appropriate units. Figure 3.0 N' 4.0 N y 3.0...
-
Planning is one of the most important management functions in any business. A front office managers first step in planning should involve determine the departments goals. Planning also includes...
-
4. The period of Jupiters moon lo is 1.5 x 10's and has a radius of orbit of 4.2 x 108 m calculate the mass of Jupiter using this information (1.9 x 1027 kg) 5. A lunar lander is to be placed in...
-
Bruceton Farms Equipment Company had goodwill valued at $80 million on its balance sheet at year-end. A review of the goodwill by the company's CFO indicated that the goodwill was impaired and was...
-
How does deferred or unearned revenue arise? Why can it be classified properly as a current liability? Give several examples of business activities that result in unearned revenues.
-
Sunset Boards is a small company that manufactures and sells surfboards in Malibu. Tad Marks, the founder of the company, is in charge of the design and sale of the surfboards, but his background is...
-
The cross section of a copper strip is \(1.0 \mathrm{~mm}\) thick and \(20 \mathrm{~mm}\) wide. There is a 10-A current through this cross section, with the charge carriers traveling down the length...
-
Fair Lakes Hospital Corporation has been operating ambulatory surgery centers in Groveton and Stockdale, two small communities each about an hour away from its main hospital. As a cost control...
-
Write the SOP Boolean expressions for the 7 outputs. [ 3 . 5 marks ]
-
Assume the same data as in P20-15 and that Provincial Airlines Corp. has an incremental borrowing rate of 8%. Instructions Answer the following questions, rounding all numbers to the nearest dollar....
-
What are the closest-packed directions in (a) the FCC structure and (b) the HCP structure?
-
The MedCottagesometimes called a granny pod or, more properly, an auxiliary dwelling unit (ADU)is a portable hospital room that can be placed next to a private residence. The units are designed as a...
-
What is the difference between deterministic and probabilistic estimates?
-
You are working for a multinational organization and need to relay information to co-workers in Japan. Which communication method would you choose to use and why?
-
When this project was envisioned, it was possible to state a vision for the outcomes, but way too premature to try to describe specific outputs. Therefore, it made sense to use an agile approach....
-
It is December of 2021. The NHL Health System has many decisions to make. Its capital budgeting process requires that the largest capital investments, those more than $5 million, be approved by the...
-
What is the most popular server programming language? Can you find which web application or online systems areprogrammed with PHP? Why do you think that most web pages today require...
-
Akramin just graduated with a Master of Engineering in Manufacturing Engineering and landed a new job in Melaka with a starting salary of RM 4,000 per month. There are a number of things that he...
-
A national homebuilder builds single family homes and condominium style townhouses. The Excel file House Sales provides information on the selling price, lot cost, type of home, and region of the...
-
Fuller Legal Services wants to determine how much time to allocate to four different services: business consulting, criminal work, nonprofit consulting, and wills/ trusts. Mr. Fuller has determined...
-
In an example in, we developed the following cross tabulation of sales transaction data: a. Find the marginal probabilities that a sale originated in each of the four regions and the marginal...
-
A magnetic monopole is a particle that casts out a radial magnetic field satisfying \(abla \cdot \mathbf{B}=4 \pi q_{m} \delta(\mathbf{r})\) where \(q_{m}\) is the magnetic charge of the monopole. A...
-
Consider a charged relativistic particle of charge \(q\) and mass \(m\) moving in a cylindrically symmetric magnetic field with \(\mathrm{B}^{\varphi}=0\). (a) Show that this general setup can be...
-
A nice feature of the cyclotron described in the preceding problem is that the alternating current frequency applied to the "Dees" is a constant \(\omega=q B / m c\) for nonrelativistic particles,...
Study smarter with the SolutionInn App