Question: Introduction This assignment is based on the three datasets collected from Kaggle.com. You may refer to this website for basic explanation, cleaning, visualization, and analysis

Introduction This assignment is based on the three datasets collected from Kaggle.com.

You may refer to this website for basic explanation, cleaning, visualization, and analysis of the datasets. The three datasets are - surveySchema.csv, freeFormResponses.csv, and multipleChoiceResponses.csv. Tasks Complete the following

tasks:

1. Calculate the median income of male employees and the median income of female employee in the population. Consider the set of all employees in the datasets as the population.

2. Draw an overlaid graph to show the histograms of the incomes of female and male employees in the population. (You create one histogram for male, and another histogram for female, but the two histograms should be on display in the same graph with different colors.

3. Use: random sampling, empirical distributions, sample comparisons, bootstrap, and hypothesis testing as well as A/B testing - that we discussed in the class - to analyze the income gap between female and male employees.

Select a sample from the population. Make sure your sample include 500 employees selected from the population, and consider how to ensure the sampling strategy is fair since the datasets include an overwhelming number of male employees compared to female employees

Define the test statistic, the null hypothesis and the alternative hypothesis

Draw the income histogram for the sample; calculate the median income of the sample; and draw a red dot and a yellow dot for the female median income and male median income of the population respectively, in the histogram .

Draw the histogram of the test statistic of the sample, and draw a red dot to show the corresponding test statistic of the population (e.g. the difference of the median incomes between female and male employees) in the diagram

. Write a procedure to use bootstrap to produce at least 5000 samples

. Draw the histogram of the test statistic of the bootstrap samples

Define the confidence interval and P-value to validate the hypothesis you defined

. 4. Submit all your Python code; and in writing, explain the data cleaning procedure that explains how you defined the test statistic, the hypotheses, random sampling, bootstrap, confidential intervals, Pvales, as well as interpretation of your results, and all outputs described above.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!