Question: In this lab you will use various SQL statements to perform data cleansing and data analysis. The following tables from a sales data warehouse will
In this lab you will use various SQL statements to perform data cleansing and data analysis. The following tables from a sales data warehouse will be used for this lab.
Fact Table
sales
| Column | Type | Key |
| transaction_id | int | Primary key |
| sales_id | int | Foreign key |
| product_id | int | Foreign key |
| region_id | int | Foreign key |
| units_sold | int | Foreign key |
| customer_id | in | Foreign key |
Dimension Tables
products
| Column | Type | Key |
| product_id | int | Primary key |
| name | varchar |
|
| retail_price | decimal(12,2) |
|
categories
| Column | Type | Key |
| category_id | int | Primary key |
| name | varchar |
|
product_category
| Column | Type | Key |
| product_id | int | Primary key |
| category_id | int |
|
customers
| Column | Type | Key |
| customer_id | int | Primary key |
| name | varchar |
|
| state | varchar |
|
| addr1 | varchar |
|
| addr2 | varchar |
|
| zip | varchar |
|
| city | varchar |
|
regions
| Column | Type | Key |
| region_id | int | Primary key |
| name | varchar |
|
sales_people
| Column | Type | Key |
| sales_id | int | Primary key |
| fname | varchar |
|
| lname | varchar |
|
| yrs_of_service | int |
|
| date_of_hire | date |
|
Step 1 - Cleanse Data Set
Examine the data for the following inconsistent or missing data characteristics. Develop a policy for dealing with the inconsistent or missing data. That policy could be ignoring the inconsistency, delete the inconsistent data, or substituting a value, potentially a correct value, for the missing data. Use SQL where necessary.
Identify missing values - FKs that are missing corresponding PKs
Recommendation: write an outer join between the tables. Identify any missing keys and develop a policy for dealing with them.
Note that depending on where this missing FK-PK value is, and inner join may solve the problem since the missing FK would not have a corresponding value in the joined table, and would therefore not be returned in the query, i.e., it would be ignored.
Identify empty columns - columns which have no values
Examine the data in the tables and identify any empty character string columns.
Identify and eliminate duplicate values in key columns
Identify any duplicate values where there should not be duplicates (primary keys)
Identify incorrect column values (numbers as text fields need to be converted to number but cannot)
Numeric data in a range
Identify any numerical values which are outside of a valid range.
Column value in a set
Identify any column values which should be in a particular set but are not in that set. For example, zip codes which are not valid zip codes, state abbreviations which are not valid state abbreviations.
Write the appropriate SQL statements to eliminate any problems you have found.
Be sure to save all SQL statements written and used. You will need to submit those as part of your submission for this assignment.
Step 2 - Create Denormalized DW Fact Table Using SQL
Write a multi-table join between the fact table and the dimension tables. Either create a new table from this join, or create an SQL view(the recommended approach). You will then use this table or view when you perform your analysis in the next step.
Be sure to save all SQL statements written and used. You will need to submit those as part of your submission for this assignment.
Step 3 - Write ETL Report
Write a report which explains what you did to cleanse the data in Step 1. For each data problem encountered in Step 1, identify the following:
What type(s) of data problems you found.
What you did to address the data problem (update the data, delete the data, ignore the problem). This might include SQL statements you ran to correct the problem.
Why you addressed the data problem they way you did.
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
