Question: This project will be used to integrate concepts developed from all the assignments in the second half of this class, specifically. You will identify a
This project will be used to integrate concepts developed from all the assignments in the second half of this class, specifically. You will identify a data driven business problem that requires preparation of the data. This preparation involves Extracting data from or more sources Transforming or cleaning the data before Loading it into a database for analysis. In other words, you will experience, firsthand, the ETL process of Data management.
Resource: This articleLinks to an external site.
In preparation for your project this term, I need you to do some digging to identify sources and ideas for a decent project.
There are a couple of decisions that have to be made. And so I am making part of the project a "deliverable" so you can begin mulling over it Most ETL tasks involve cleaning and integration. For integration, it is vital that you have an attribute that is common across the tables may be in pairs
ModelApproach
First determine whether you are going with Model X or Model Y see data sources tab Both cases may involve some preprocessing. If what you are doing doesn't fit in either, post in the project topic discussion so I can assist and we can all think it through
Cleaning
Cleaning is one of the most important steps as it ensures the quality of the data in the data warehouse. Cleaning should perform basic data unification rules, such as:
Making identifiers unique sex categories MaleFemaleUnknown MFnull ManWomanNot Available are translated to standard MaleFemaleUnknown
Convert null values into standardized Not AvailableNot Provided value
Convert phone numbers, ZIP codes to a standardized form
Validate address fields, convert them into proper naming, eg StreetStStStrStr
Validate address fields against each other StateCountry CityState CityZIP code, CityStreet
all done in SSMS or SSIS
Transform
The transform step applies a set of rules to transform the data from the source to the target. This includes
converting any measured data to the same dimension ie conformed dimension using the same units so that they can later be joined.
generating surrogate keys or FKs so that you can join data from several sources,
generating aggregates
deriving new calculated values,
Adding columns to create PKs andor FKs
all done in SSMS or SSIS
Data Integration
It is at this stage that you get the most value for the project. This typically means you are adding some attribute from a related set that adds 'Color' to the data. Perhaps Census data to labor data or other demographic data. The challenge is to locate data that are relatable.
Project direction: You will need tocomplete a datamart with significant preprocessing ETL activities.
Requirements:
Problem being solved: What do you propose to learn from this data? List several of these business questions and show how your project solution data set could answer them.
Sources: It must pull data from at least three sources similar or different and clean and load the datamart. All in one SSIS project. OR you can do this with some other tool of your choice ETL like Power BI or tableauLinks to an external site..
Volume: Total result data set must add up to at least k records, but not more than k
Destination: SQL server tables You can move all the data to a single CSV file and dump it into SQL server at the end
Transformation it must include TWO new columns that is populated by the current date and time so you know when that data was brought into the final dataset and a second one to know where the data came from source file name This may be done through SSIS or in SQL server.
Note: Filename capturing works only when the source is a flat file. So if your source is NOT a flat file, you may want to make a CSV file an intermediate destination you and then use this file as the source Hint: Use derived column transformation to add a column
In addition it must include at least of the following transformations: data conversion, derived column, data split, lookup, merge, merge join, multicast, union all, fuzzy lookup, unpivotLinks to an external site., row samplingLinks to an external site. or any of the transforms not covered here.
Data sources: You are welcome to use datasets from work that has been sufficiently "anonymizedLinks to an external site.". In fact this itself is a valuable transformation task that you can then use to protect your data and make it available for additional analysisexploration There are many public data sets that can be used see "data sources" tab
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
