Question: In this lab you will use various SQL statements to perform data cleansing and data analysis. The following tables from a sales data warehouse will

In this lab you will use various SQL statements to perform data cleansing and data analysis. The following tables from a sales data warehouse will be used for this lab.

Fact Table

sales

Column

Type

Key

transaction_id

int

Primary key

sales_id

int

Foreign key

product_id

int

Foreign key

region_id

int

Foreign key

units_sold

int

Foreign key

customer_id

in

Foreign key

Dimension Tables

products

Column

Type

Key

product_id

int

Primary key

name

varchar

retail_price

decimal(12,2)

categories

Column

Type

Key

category_id

int

Primary key

name

varchar

product_category

Column

Type

Key

product_id

int

Primary key

category_id

int

customers

Column

Type

Key

customer_id

int

Primary key

name

varchar

state

varchar

addr1

varchar

addr2

varchar

zip

varchar

city

varchar

regions

Column

Type

Key

region_id

int

Primary key

name

varchar

sales_people

Column

Type

Key

sales_id

int

Primary key

fname

varchar

lname

varchar

yrs_of_service

int

date_of_hire

date

Step 1 - Cleanse Data Set

Examine the data for the following inconsistent or missing data characteristics. Develop a policy for dealing with the inconsistent or missing data. That policy could be ignoring the inconsistency, delete the inconsistent data, or substituting a value, potentially a correct value, for the missing data. Use SQL where necessary.

Identify missing values - FKs that are missing corresponding PKs

Recommendation: write an outer join between the tables. Identify any missing keys and develop a policy for dealing with them.

Note that depending on where this missing FK-PK value is, and inner join may solve the problem since the missing FK would not have a corresponding value in the joined table, and would therefore not be returned in the query, i.e., it would be ignored.

Identify empty columns - columns which have no values

Examine the data in the tables and identify any empty character string columns.

Identify and eliminate duplicate values in key columns

Identify any duplicate values where there should not be duplicates (primary keys)

Identify incorrect column values (numbers as text fields need to be converted to number but cannot)

Numeric data in a range

Identify any numerical values which are outside of a valid range.

Column value in a set

Identify any column values which should be in a particular set but are not in that set. For example, zip codes which are not valid zip codes, state abbreviations which are not valid state abbreviations.

Write the appropriate SQL statements to eliminate any problems you have found.

Be sure to save all SQL statements written and used. You will need to submit those as part of your submission for this assignment.

Step 2 - Create Denormalized DW Fact Table Using SQL

Write a multi-table join between the fact table and the dimension tables. Either create a new table from this join, or create an SQL view(the recommended approach). You will then use this table or view when you perform your analysis in the next step.

Be sure to save all SQL statements written and used. You will need to submit those as part of your submission for this assignment.

Step 3 - Write ETL Report

Write a report which explains what you did to cleanse the data in Step 1. For each data problem encountered in Step 1, identify the following:

What type(s) of data problems you found.

What you did to address the data problem (update the data, delete the data, ignore the problem). This might include SQL statements you ran to correct the problem.

Why you addressed the data problem they way you did.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!