Question: SPARK, SPARK SQL... Chegg does not let me post the whole csv files (applicant.csv and record.csv) but i hope someone can still help. Below is

SPARK, SPARK SQL...

Chegg does not let me post the whole csv files (applicant.csv and record.csv) but i hope someone can still help. Below is a snippet of each csv files as an example:

- applicant.csv

SPARK, SPARK SQL... Chegg does not let me post the whole csv

- record.csv

files (applicant.csv and record.csv) but i hope someone can still help. Below

QUESTION:

Please help with this part (Q1-Q5). I need it as soon as possible! steps and descriptions with pictures needed.

If any programming language is needed, it must be Python. Everything else is Spark and "pyspark" console. Thank you!

applicant.csv Saved to Drive File Edit View Insert Format Data Tools Extensions Help Lasteditwas secondsago record.csv ( Saved to Drive File Edit View Insert Format Data Tools Extensions Help Last edit was Submit a single doc/pdf file that has Spark codes and an English description of what your code is doing. Also, include screenshots of your code and the output in the file. Perform the commands on "pyspark" console. Part I (60pts) Find applicant.csv and record.csv files from the course shell and answer the following questions: - The applicant.csv has information about the personal information of the credit card applicant. - ID: Client number; - GENDER: Gender; - OWN_CAR: Is there a car; - OWN_REALTY: Is there a property; - CHILDREN: Number of children; - INCOME_TOTAL: Annual income; - INCOME_TYPE: Income category; - DAYS_BIRTH: Birthday (Count backward from current day (0), -1 means yesterday); DAYS_EMPLOYED: Start date of employment (Count backward from current day (0). If positive, it means the person is currently unemployed.) - OCCUPATION_TYPE: Occupation; - FAM_MEMBERS: Family size; - The record.csv has the credit record of the applicant and consisted of three features. - ID: Client number; - MONTHS_BALANCE: Record month (The month of the extracted data is the starting point, backward, 0 is the current month, 1 is the previous month, and so on). - STATUS: Status (0: 1-29 days past due 1: 30-59 days past due 2: 60-89 days overdue 3: 90-119 days overdue 4: 120-149 days overdue 5: Overdue or bad debts, write-offs for more than 150 days C: paid off that month X : No loan for the month) Q1. How many male and female applicants applied for the credit card? (10 pts) Q2. Calculate the average annual income amount of the applicants for each of the income types (10 pts) Q3. Count the number of credit card applicants based on age group (10 pts) Q4. Merge the two data frames using inner join so that all variables (columns) in the applicant frame are added to the record data frame. Name the merged frame master_frame. How many observations (rows) are present in master_frame? Hint: Find an attribute from both data frames that can serve as a unique key (10 pts) Q5. Considering the clients whose credit record is more than 90 days due, as bad debt, find their occupations whose are not in bad dept and not unemployed ( 20 pts) applicant.csv Saved to Drive File Edit View Insert Format Data Tools Extensions Help Lasteditwas secondsago record.csv ( Saved to Drive File Edit View Insert Format Data Tools Extensions Help Last edit was Submit a single doc/pdf file that has Spark codes and an English description of what your code is doing. Also, include screenshots of your code and the output in the file. Perform the commands on "pyspark" console. Part I (60pts) Find applicant.csv and record.csv files from the course shell and answer the following questions: - The applicant.csv has information about the personal information of the credit card applicant. - ID: Client number; - GENDER: Gender; - OWN_CAR: Is there a car; - OWN_REALTY: Is there a property; - CHILDREN: Number of children; - INCOME_TOTAL: Annual income; - INCOME_TYPE: Income category; - DAYS_BIRTH: Birthday (Count backward from current day (0), -1 means yesterday); DAYS_EMPLOYED: Start date of employment (Count backward from current day (0). If positive, it means the person is currently unemployed.) - OCCUPATION_TYPE: Occupation; - FAM_MEMBERS: Family size; - The record.csv has the credit record of the applicant and consisted of three features. - ID: Client number; - MONTHS_BALANCE: Record month (The month of the extracted data is the starting point, backward, 0 is the current month, 1 is the previous month, and so on). - STATUS: Status (0: 1-29 days past due 1: 30-59 days past due 2: 60-89 days overdue 3: 90-119 days overdue 4: 120-149 days overdue 5: Overdue or bad debts, write-offs for more than 150 days C: paid off that month X : No loan for the month) Q1. How many male and female applicants applied for the credit card? (10 pts) Q2. Calculate the average annual income amount of the applicants for each of the income types (10 pts) Q3. Count the number of credit card applicants based on age group (10 pts) Q4. Merge the two data frames using inner join so that all variables (columns) in the applicant frame are added to the record data frame. Name the merged frame master_frame. How many observations (rows) are present in master_frame? Hint: Find an attribute from both data frames that can serve as a unique key (10 pts) Q5. Considering the clients whose credit record is more than 90 days due, as bad debt, find their occupations whose are not in bad dept and not unemployed ( 20 pts)

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Accounting Questions!

Illustrate with examples from your group project, or from experience you gained working for a commercial software developer, or both. You may use this reference for the question In each case, would...

Notice both transcribed also, consequently created grammar analysers. As such, the Economics major is particularly well suited to prepare students for the career challenges in the 21st century by...

Which of the following code notebook types in Dataiku DSS allow you to run jobs on Spark? Both SQL and all available Jupyter notebooks All Jupyter notebooks SQL notebooks Only Python Jupyter notebooks

A. We have now established a Hive table "Company Data" and Two Spark Dataframe SQL accessible views "Salesview" and "Empview" we are now in a position to provide some analysis on Mrs adebabamo's ask....

Question 3. Java programming A. We have now established a Hive table "Company Data" and Two Spark Dataframe SQL accessible views "1Salesview" and "1Empview" we are now in a position to provide some...

Give one example of a unique names axiom and one example of a unique actions axiom that might appear in Evil Robot's knowledge base for this problem. Explain why such axioms are required. [4 marks]...

Create a the typing rules for variables, function abstraction, function application, and let-binding. Make the form of the typing judgement clear.[5 marks] Describe one static priority and one...

3: java programming Imagine, coming about because of part (b) may be utilized to a sentence structure analyser taking atoken stream as information (by means of calls to work lex()) and giving as...

Which of the following is NOT true? Question 1 options: Output operations on DStream trigger the actual execution of all the DStream transformations. Transformations do not allow the data from the...

Well-known utility functions may be assumed to be available. 3 (TURN OVER) CST.2016.1.4 SECTION B 3 Object-Oriented Programming Java generics allows an ArrayList object to be constrained to use a...

GM has a current stock price of $92.38. If they issued a dividend of $4.56 last week, and the dividend is projected to grow at 4% what is the cost of equity capital for GM?

Bombardier Inc., with headquarters in Montreal, is a world-leading manufacturer of innovative transportation equipment, including aircraft and rail transportation equipment, systems, and services....

Rather than generating tax revenue as do tariffs, subsidies require tax revenue. Therefore, they are not an effective protective device for the home economy. Do you agree?

Does it exceed two pages in length?

Assume you are general manager of a small seafood company. Most training is unstructured and occurs on the job. Currently, senior fish cleaners are responsible for teaching new employees how to...

Does it avoid typos and grammatical errors?