You are doing initial exploratory analysis in PySpark and one of the sources you need to...
Fantastic news! We've Found the answer you've been seeking!
Question:
Transcribed Image Text:
You are doing initial exploratory analysis in PySpark and one of the sources you need to include resides in a PostgreSQL database. Using the provided notebook, answer the following questions within Google Colaboratory and submit your answers on SUNLearn and Git. 1. Using a PySpark dataframe, print the schema of customer table in the pagila PostgreSQL database by utilising a JDBC connection. (3) 2. Use the Spark SQL API to query the customer table, compute the number of unique email addresses in that table and print the result in the notebook. (3) 3. Repeat this calculation using only the Dataframe API and print the result. (1) 4. How many partitions are present in the dataframe resulting from Question 6.3 (additionally provide the code necessary to determine that). (1) 5. Compute the min and max of customer.create_date and print the result (once more using the Spark DataFrame API and not the Spark SQL API). (1) 6. Determine which first names occur more than once 1. using the Spark SQL API (printing the result), and (1) 2. using the Spark Dataframe API (printing the result once more). (1) 7. Port the PostgreSQL below to the PySpark DataFrame API and execute the query within Spark (not directly on PostgreSQL): (5) 1 SELECT 2 3 4 5 FROM payment 6 staff.first_name .staff.last_name ,SUM (payment. amount) INNER JOIN staff ON payment. staff_id staff.staff_id 7 WHERE payment.payment_date BETWEEN 2007-01-01 AND 2020-02-01 8 GROUP BY 9 staff.last_name 10 .staff.first_name. You are doing initial exploratory analysis in PySpark and one of the sources you need to include resides in a PostgreSQL database. Using the provided notebook, answer the following questions within Google Colaboratory and submit your answers on SUNLearn and Git. 1. Using a PySpark dataframe, print the schema of customer table in the pagila PostgreSQL database by utilising a JDBC connection. (3) 2. Use the Spark SQL API to query the customer table, compute the number of unique email addresses in that table and print the result in the notebook. (3) 3. Repeat this calculation using only the Dataframe API and print the result. (1) 4. How many partitions are present in the dataframe resulting from Question 6.3 (additionally provide the code necessary to determine that). (1) 5. Compute the min and max of customer.create_date and print the result (once more using the Spark DataFrame API and not the Spark SQL API). (1) 6. Determine which first names occur more than once 1. using the Spark SQL API (printing the result), and (1) 2. using the Spark Dataframe API (printing the result once more). (1) 7. Port the PostgreSQL below to the PySpark DataFrame API and execute the query within Spark (not directly on PostgreSQL): (5) 1 SELECT 2 3 4 5 FROM payment 6 staff.first_name .staff.last_name ,SUM (payment. amount) INNER JOIN staff ON payment. staff_id staff.staff_id 7 WHERE payment.payment_date BETWEEN 2007-01-01 AND 2020-02-01 8 GROUP BY 9 staff.last_name 10 .staff.first_name.
Expert Answer:
Answer rating: 100% (QA)
Using a PySpark dataframe print the schema of customer table in the pagila PostgreSQL database by utilising a JpDBC connection Python Import necessary libraries from pysparksql import SparkSession Cre... View the full answer
Related Book For
Spreadsheet Modeling & Decision Analysis A Practical Introduction to Management Science
ISBN: 978-0324656633
5th edition
Authors: Cliff T. Ragsdale
Posted Date:
Students also viewed these computer engineering questions
-
Refer to question 22 at the end of Chapter 2. Implement a spreadsheet model for this problem and solve it using Solver.
-
a. Create a spreadsheet model for this problem and solve it. What is the optimal solution? b. Of the constraints Tom placed on this problem, which are binding or preventing the objective function...
-
Refer to question 10 at the end of Chapter 6. Assume that task A actually finished at 3 weeks, task B actually finished at 12 weeks, and task C actually finished at 13 weeks. Recalculate the expected...
-
Examine the articles reproduced below and consider how the five C's discussed in the course have application in the present coronavirus pandemic. "To the extent that an environment characterized by...
-
The 60-W fan of a central heating system is to circulate air through the ducts. The analysis of the flow shows that the fan needs to raise the pressure of air by 50 Pa to maintain flow. The fan is...
-
Explain the attitude of the courts toward exemption clauses.
-
True or False: If \(E R R>M A R R\), then \(I R R>E R R>M A R R\).
-
The comparative condensed income statements of Emley Corporation are shown below. Instructions (a) Prepare a horizontal analysis of the income statement data for Emley Corporation using 2014 as a...
-
Suppose one side of your home was damaged in a storm. To fix it you need to put new siding on the damaged exterior wall. The damaged wall measures 30.2 feet in length and 22.7 feet in width. The...
-
Taiwan is a major world supplier of semiconductor chips. A recent earthquake severely damaged the production facilities of Taiwanese chip - producing companies, sharply reducing the amount of chips...
-
Employeeld OfficeNumber EM32221 EM32222 EM32222 EM32221 OF004 OF003 OF004 OF003 Office Address 123 Glades Rd, Boca Raton, FL 456 University Rd, Davie, FL 123 Glades Rd, Boca Raton, FL 456 University...
-
Ruby is injured in a car accident, which was Bubba's fault. She sued him for $150,000 alleging negligence, which involves a question of state law. Ruby is a citizen of North Carolina, while Bubba is...
-
A factory made 500 jars of peanut butter. 185 of the jars contained creamy peanut butter. What percentage of the jars of peanut butter were creamy?
-
What is the market price of a zero-coupon bond (that is, a bond that will not pay any coupon payments) that will mature in 10 years and has the face value of $1,000? Assume the yield to maturity is...
-
With only two securities in the portfolio, we will have equal number of variance and covariance boxes. Instead, with n securities, how many variances and covariances will you have in the portfolio...
-
Using the profitability index, which of the following mutually exclusive projects should be accepted? Project A: NPV = $6,000; NINV = $50,000 Project B: NPV = $10,000; NINV = $120,000 Project C: NPV...
-
Here, let's consider mass flux of water flowing with a vector velocity field, and we will assume that the water is incompressible with a fixed mass density p= 10 kg/m. We can then describe the water...
-
Accounting policies and practices that are most important to the portrayal of the companys financial condition and results, and require managements most difficult, subjective, or complex judgments...
-
The day after a snowstorm, cars arrive at Mels Auto-Wash at an average rate of 10 per hour according to a Poisson process. The automated car washing process takes exactly 5 minutes from start to...
-
Bull Dog Express runs a small airline that offers commuter flights between several cities in Georgia. The airline flies into and out of small airports only. These airports have limits on the number...
-
Use the multiplicative seasonal technique for stationary data to model the data. Use Solver to determine the optimal values of and . a. What are the optimal values of and ? b. Prepare a line graph...
-
In a recent study of how mice negotiate turns, the mice ran around a circular 90 turn on a track with a radius of 0.15 m. The maximum speed measured for a mouse (mass = 18.5 g) running around this...
-
It is well known that runners run more slowly around a curved track than a straight one. One hypothesis to explain this is that the total force from the track on a runners feet the magnitude of the...
-
You are driving your car through a roundabout that has a radius of 9.0 m. Your physics textbook is lying on the seat next to you. What is the fastest speed at which you can go around the curve...
Study smarter with the SolutionInn App