You are doing initial exploratory analysis in PySpark and one of the sources you need to...
Fantastic news! We've Found the answer you've been seeking!
Question:
Transcribed Image Text:
You are doing initial exploratory analysis in PySpark and one of the sources you need to include resides in a PostgreSQL database. Using the provided notebook, answer the following questions within Google Colaboratory and submit your answers on SUNLearn and Git. 1. Using a PySpark dataframe, print the schema of customer table in the pagila PostgreSQL database by utilising a JDBC connection. (3) 2. Use the Spark SQL API to query the customer table, compute the number of unique email addresses in that table and print the result in the notebook. (3) 3. Repeat this calculation using only the Dataframe API and print the result. (1) 4. How many partitions are present in the dataframe resulting from Question 6.3 (additionally provide the code necessary to determine that). (1) 5. Compute the min and max of customer.create_date and print the result (once more using the Spark DataFrame API and not the Spark SQL API). (1) 6. Determine which first names occur more than once 1. using the Spark SQL API (printing the result), and (1) 2. using the Spark Dataframe API (printing the result once more). (1) 7. Port the PostgreSQL below to the PySpark DataFrame API and execute the query within Spark (not directly on PostgreSQL): (5) 1 SELECT 2 3 4 5 FROM payment 6 staff.first_name .staff.last_name ,SUM (payment. amount) INNER JOIN staff ON payment. staff_id staff.staff_id 7 WHERE payment.payment_date BETWEEN 2007-01-01 AND 2020-02-01 8 GROUP BY 9 staff.last_name 10 .staff.first_name. You are doing initial exploratory analysis in PySpark and one of the sources you need to include resides in a PostgreSQL database. Using the provided notebook, answer the following questions within Google Colaboratory and submit your answers on SUNLearn and Git. 1. Using a PySpark dataframe, print the schema of customer table in the pagila PostgreSQL database by utilising a JDBC connection. (3) 2. Use the Spark SQL API to query the customer table, compute the number of unique email addresses in that table and print the result in the notebook. (3) 3. Repeat this calculation using only the Dataframe API and print the result. (1) 4. How many partitions are present in the dataframe resulting from Question 6.3 (additionally provide the code necessary to determine that). (1) 5. Compute the min and max of customer.create_date and print the result (once more using the Spark DataFrame API and not the Spark SQL API). (1) 6. Determine which first names occur more than once 1. using the Spark SQL API (printing the result), and (1) 2. using the Spark Dataframe API (printing the result once more). (1) 7. Port the PostgreSQL below to the PySpark DataFrame API and execute the query within Spark (not directly on PostgreSQL): (5) 1 SELECT 2 3 4 5 FROM payment 6 staff.first_name .staff.last_name ,SUM (payment. amount) INNER JOIN staff ON payment. staff_id staff.staff_id 7 WHERE payment.payment_date BETWEEN 2007-01-01 AND 2020-02-01 8 GROUP BY 9 staff.last_name 10 .staff.first_name.
Expert Answer:
Answer rating: 100% (QA)
Using a PySpark dataframe print the schema of customer table in the pagila PostgreSQL database by utilising a JpDBC connection Python Import necessary libraries from pysparksql import SparkSession Cre... View the full answer
Related Book For
Spreadsheet Modeling & Decision Analysis A Practical Introduction to Management Science
ISBN: 978-0324656633
5th edition
Authors: Cliff T. Ragsdale
Posted Date:
Students also viewed these computer engineering questions
-
Refer to question 22 at the end of Chapter 2. Implement a spreadsheet model for this problem and solve it using Solver.
-
a. Create a spreadsheet model for this problem and solve it. What is the optimal solution? b. Of the constraints Tom placed on this problem, which are binding or preventing the objective function...
-
Refer to question 10 at the end of Chapter 6. Assume that task A actually finished at 3 weeks, task B actually finished at 12 weeks, and task C actually finished at 13 weeks. Recalculate the expected...
-
Examine the articles reproduced below and consider how the five C's discussed in the course have application in the present coronavirus pandemic. "To the extent that an environment characterized by...
-
For each of the following situations in which similar workers are paid different wages, give the most likely explanation for these wage differences. a. Test pilots for new jet aircraft earn higher...
-
An initially neutral electrscope is charged by induction by bringing near a positively charged object. If 3.22 x 108 electrons flow through the ground wire to Earth and the ground wire is then...
-
Which of the following control procedures is most likely to prevent or detect errors or frauds resulting from the production of unauthorized products or unauthorized quantities of authorized...
-
House Max Builders constructs modular homes, and last year their cost of goods sold was $18,500,000. It operates 50 weeks per year. The company has the following inventory of raw materials,...
-
Wagner plc is considering a project which will generate cash flows of 5 , 0 0 0 each year from years 3 to 7 . The company has a cost of capital of 1 4 % . What is the total present value of the cash...
-
The firm is considering either leasing or buying new $19,000 equipment. The lessor will charge $12,000 a year for a two-year lease. The equipment has a two-year life after which time it is expected...
-
O 4 percent C A TV show on tourism was found to be watched by 9,000 homes. Around 27,000 households that receive the station's broadcasts have TV sets. What is the program rating of the show? 000...
-
In this Focus on the Job, students have an opportunity to explore common characteristics of perpetrators and victims of abuse as evidenced in the context of a family law case. The Facts You are a...
-
Describe the requirements of the National Quality Standards as well as relatedregulations and laws related to children's health and safety. Question 2. Read the statement and find the relevant...
-
How to structure this question in an essay? You are a researcher with the Commonwealth Government, working in the Department of Foreign Affairs and Trade ('DFAT'). The Department deals with many...
-
How would you respond to the following statement? Part A-Introduction Greetings to all! My name is Tara L. Dymarczyk. I live in Connecticut close to Yale University. I have two grown children. My...
-
You are on duty as the police officer on patrol. While driving around the city, you come upon the following situations. Review the scenarios carefully and then address the prompts following the...
-
Assume Michael uses the production method to account for scrap. Further, Michael finds several pieces of scrap that can be used in other projects. He estimates the reusable scrap quantity to be 15...
-
Accounting policies and practices that are most important to the portrayal of the companys financial condition and results, and require managements most difficult, subjective, or complex judgments...
-
The day after a snowstorm, cars arrive at Mels Auto-Wash at an average rate of 10 per hour according to a Poisson process. The automated car washing process takes exactly 5 minutes from start to...
-
Bull Dog Express runs a small airline that offers commuter flights between several cities in Georgia. The airline flies into and out of small airports only. These airports have limits on the number...
-
Use the multiplicative seasonal technique for stationary data to model the data. Use Solver to determine the optimal values of and . a. What are the optimal values of and ? b. Prepare a line graph...
-
The inversion point of a gas can be mathematically expressed as (a) \(T_{i}=\frac{2 a b}{R}\) (b) \(T_{i}=\frac{2 b}{R a}\) (c) \(T_{i}=\frac{2 a}{R b}\) (d) None of these.
-
Residual free energy is defined as (a) \(G^{R}=G-G^{\mathrm{ig}}\) (b) \(G^{R}=G^{\mathrm{ig}}-G\) (c) \(G^{R}=G+G^{\mathrm{ig}}\) (d) None of these.
-
Departure functions are useful to calculate the thermodynamic property of real fluids (a) When the \(P-V-T\) data of the substance is unavailable (b) When the \(P-V-T\) data of the substance is...
Study smarter with the SolutionInn App