Question: Research Project Second Assignment (35 Points) In this week's research project assignment, you are required to 1. Clean the data by checking for outliers and
Research Project Second Assignment (35 Points)
In this week's research project assignment, you are required to
1. Clean the data by checking for outliers and missing data
Data cleaning is the process of inspecting your data for:
- Unusual entries or outliers
- Missing data
- Incorrect data entries
- Taking action on any data issues identified and accurately documenting the action taken.
For more information on data cleaning and exploration, read the article in the following link
https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/
2. Explore your project data set and summarize it using descriptive statistics, graphs, etc.
You will need to provide summary statistics of each variable in your data set. If categorical, state the levels. There are many ways to summarize your data and you are encouraged to be creative but also accurate in how you summarize and present your data.
In general:
- A categorical variable is summarized using a frequency table and visualized using bar charts and pie charts
- A pair of categorical variables is summarized using a contingency table
- A numeric variable is summarized using descriptive statistics: measures of central tendency (mean, median, and mode), measures of variation or dispersion (range, standard deviation), and measures of position (z-scores, percentiles).
- A histogram, dot plot or stem-and-leaf plot, are used to provide visual information on the distribution of a variable
- An outlier can easily be identified using a box plot
- Visual inspection of histogram can also be used to assess if a variable is normally distributed
- A pair of numeric variables is summarized using a scatter plot
- A scatter plot is usually a good indicator of whether two variables are correlated or not
For this project (and this short assignment), you need:
- One graph for each variable. If you have 1 y-variable and 3 x-variables, you will need 4 graphs. Use either a histogram or boxplot for numeric variables, and either a bar graph or pie chart for categorical variables.
- One graph for each x--y combination. If your y-variable is Test Score and your x-variables are Sex, Hours Study, and IQ, you will need 3 graphs: 1 graph showing Test Score and Sex (side-by-side boxplot), 1 graph showing Test Score and Hours Study (Scatterplot), 1 graph showing Test Score and IQ (Scatterplot).
- Descriptive statistics for all numeric variables, i.e. Test Score, Hours Study, and IQ
- Frequency table for all categorical variables, i.e. Sex
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
