Question: Using spark 1. Read the dataset using sqlContext from pyspark.sql import SQLContext sqlContext = SQLContext(sc) spark_df = sqlContext.sql(Select * from Washington_State_HDMA_2016_csv) 2. Compute how many

Using spark

1. Read the dataset using sqlContext

from pyspark.sql import SQLContext sqlContext = SQLContext(sc) spark_df = sqlContext.sql("Select * from Washington_State_HDMA_2016_csv")

2. Compute how many floating and string variables this dataset has

num_float = # (help here) num_string = # (help here)

3. Create a new column named denied (help)

Assume that if denial_reason_name_1 column is not null, then the loan application is rejected/denied

Create a new column in the dataset - Name the column as denied

Encode the denied column as 0 if denial_reason_name_1 is null, otherwise encode the denied column as 1

4. Find the percentage of denied loans (help)

Use the new variable named denied in this analysis

What percentage of loans are denied?

Google the average loan application denial rate in the country. Is this number similar to the US average?

5. Compare the income of approved applicants vs rejected applicants (help)

Use applicant_income_000s variable

Calculate the average income for denied = 1 and denied = 0 applicants (you can use groupBy())

What do you think (e.g., approved aplicants make more money?)

If not, this is against our intuition. Why do you think denied applicants make more money?

6. Relationship between sex and application status (help)

Investigate if female applicants have higher rejection rate as compared to male applicants

Find the rejection rate for males and females.

For simplicity, consider rejection rate is number of denied applicants(denied = 1) / number of approved applicants (denied = 0)

Use applicant_sex_name for detemining the sex of the applicant

Any comments?

7. Relationship between race and application status (help)

Investigate the relationship between the applicants race and the loan status.

You can use the denied column you have created and applicant_race_name_1 column

For each race, find the ratio of denied loans

Consider the ratio of denied loans as the number of denied applicants(denied = 1) / number of approved applicants (denied = 0)

What are your comments? Which race has the highest denied ratio?

8. Check loan_income_ratio (help)

Let's do some more deep down analysis

Let's create a new variable by dividing the loan_amount_000s with applicant_income_000s

Name this variable loan_to_income_ratio

Let's check if the denied loans are the ones with high loan_to_income_ratio.

What are your thoughts?

hint: logically, we expect that the denied loans should have higher loan_to_income_ratio. Is this the case? Include race variable into the analysis. What do you think about the relationship among applicant_race_name_1, loan_to_income_ratio, denied variables?

9. What is the most common denial reason (help)

Use the denial_reason_name_1 variable

Google the most common mortgage denial reasons. Did you get similar results?

10. Give at least 3 more insights (help)

Give us more insights. Use your intuition and do some more analysis to give us more insights about the dataset.

Feel free to experiment

You can use python visualization tools

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!