Data cleaning is a very important step in Data Science to get meaningful analytic results or...
Fantastic news! We've Found the answer you've been seeking!
Question:
Transcribed Image Text:
Data cleaning is a very important step in Data Science to get meaningful analytic results or beneficial prediction outcomes. This assignment aims at helping students to develop knowledge and analytic skills to properly clean the unprocessed data for data analysis and model training. The given dataset is a collection of airline data containing 8 attributes and 6150 records. A data dictionary describing each attribute is given in the table below. You are expected to complete the steps in the following section to clean the dataset and to answer the questions in Parts A and B. Attribute Airline ID Name Alias IATA ICAO Callsign Country Active Description Unique OpenFlights identifier for this airline. Name of the airline. Alias of the airline. 2-letter ICAO code, if available. 3-letter ICAO code, if available. Airline callsign. The airline;s incorporated country or territory. "y" if the airline is or has until recently been operational, "N" if it is defunct. The dataset (airlines_2022.csv) is dirty. Assume that you have received instructions from a senior data scientist to develop an appropriate solution to clean the dataset. The following steps are proposed to clean the dataset after loading it into the selected tool (e.g. Jupyter Notebook), in order to meet the requirements of the senior data scientist: Step 1. Check the original dataset for any duplicate tuples based on the value of Airline ID. o Remove any duplicate tuples in the dataset. Step 2. The Airline ID should range from 0 to any positive integer. o If the original Airline ID is negative, set it to zero. o Add a new column to store the cleaned data. Do not overwrite the original Airline ID. Step 3. A valid airline Name should start should start with either English alphabet or numerical number only. o If the original airline Name starts with non-English alphabet, replace it with "unknown". o Add a new column to store the cleaned data. Do not overwrite the original airline Name. Step 4. A valid Alias should start with English alphabet or numerical number only. Only six symbols are acceptable between alphabets and numbers: "-". "&", ".", [space], "{", and ")". o If Alias value is "\N", "\n" or missing, replace it with "unknown". o If other symbols appear in the Alias value, replace it with "unknown". o Add a new column to store the cleaned data. Do not overwrite the original Alias value. Step 5. A valid IATA is composed of two English alphabets, two numerical numbers, or a combination of one English alphabet and one numerical number. o If the IATA value is not valid or missing, replace it with "unknown". o Add a new column to store the cleaned data. Do not overwrite the original IATA value. Step 6. A valid ICAO should contain three characters only. The characters can be English alphabets or numerical numbers. o If the ICAO value is "\N", "\n" or missing, replace it with "unknown". o Add a new column to store the cleaned data. Do not overwrite the original ICAO value. Step 7. A valid CallSign should contain three characters only. The characters can be English alphabets or numerical numbers. o If the CallSign value is "\N", "\n" or missing, replace it with "unknown". o Add a new column to store the cleaned data. Do not overwrite the original CallSign value. Questions Part A Answer the following questions based on the steps previously described: 1. How many unique tuples are there in the dataset after Step 1? 2. How many unique values are there in the Airline ID attribute after Step 2? 3. How many unique values are there in the Name attribute after Step 3? 4. How many unique values are there in the Alias attribute after Step 4? 5. How many unique values are there in the IATA attribute after Step 5? 6. How many unique values are there in the ICAO attribute after Step 6? 7. How many unique values are there in the CallSign attribute after Step 7? 8. How many unique values are there in the Country attribute after cleaning? 9. How many unique values are there in the Active attribute after cleaning? 10. How many "unknown" are included in the Name attribute after cleaning? 11. How many "unknown" are included in the Alias attribute after cleaning? 12. How many "unknown" are included in the IATA attribute after cleaning? 13. How many "unknown" are included in the ICAO attribute after cleaning? 14. How many "unknown" are included in the CallSign attribute after cleaning? 15. How many "unknown" are included in the Country attribute after cleaning? 16. How many "unknown" are included in the Active attribute after cleaning? 17. How many tuples are pending removal based on the Name attribute after cleaning? 18. How many tuples are pending removal based on the Alias attribute after cleaning? 19. How many tuples are pending removal based on the IATA attribute after cleaning? 20. How many tuples are pending removal based on the ICAO attribute after cleaning? 21. How many tuples are pending removal based on the CallSign attribute after cleaning? 22. How many tuples are pending removal based on the Country attribute after cleaning? 23. How many unique tuples included the cleaned dataset? 24. How many percent of the tuples are removed from the dataset after cleaning? 25. Which attribute causes the most tuples being removed from the dataset? 26. After the data cleaning process, which attribute(s) is/are the target(s) for you to go back and ask for more detail or discuss on the solutions with your client? Part B According to the cleaned data, which country owns the most active operation routes? How many active operation routes is currently owned by that country? Data cleaning is a very important step in Data Science to get meaningful analytic results or beneficial prediction outcomes. This assignment aims at helping students to develop knowledge and analytic skills to properly clean the unprocessed data for data analysis and model training. The given dataset is a collection of airline data containing 8 attributes and 6150 records. A data dictionary describing each attribute is given in the table below. You are expected to complete the steps in the following section to clean the dataset and to answer the questions in Parts A and B. Attribute Airline ID Name Alias IATA ICAO Callsign Country Active Description Unique OpenFlights identifier for this airline. Name of the airline. Alias of the airline. 2-letter ICAO code, if available. 3-letter ICAO code, if available. Airline callsign. The airline;s incorporated country or territory. "y" if the airline is or has until recently been operational, "N" if it is defunct. The dataset (airlines_2022.csv) is dirty. Assume that you have received instructions from a senior data scientist to develop an appropriate solution to clean the dataset. The following steps are proposed to clean the dataset after loading it into the selected tool (e.g. Jupyter Notebook), in order to meet the requirements of the senior data scientist: Step 1. Check the original dataset for any duplicate tuples based on the value of Airline ID. o Remove any duplicate tuples in the dataset. Step 2. The Airline ID should range from 0 to any positive integer. o If the original Airline ID is negative, set it to zero. o Add a new column to store the cleaned data. Do not overwrite the original Airline ID. Step 3. A valid airline Name should start should start with either English alphabet or numerical number only. o If the original airline Name starts with non-English alphabet, replace it with "unknown". o Add a new column to store the cleaned data. Do not overwrite the original airline Name. Step 4. A valid Alias should start with English alphabet or numerical number only. Only six symbols are acceptable between alphabets and numbers: "-". "&", ".", [space], "{", and ")". o If Alias value is "\N", "\n" or missing, replace it with "unknown". o If other symbols appear in the Alias value, replace it with "unknown". o Add a new column to store the cleaned data. Do not overwrite the original Alias value. Step 5. A valid IATA is composed of two English alphabets, two numerical numbers, or a combination of one English alphabet and one numerical number. o If the IATA value is not valid or missing, replace it with "unknown". o Add a new column to store the cleaned data. Do not overwrite the original IATA value. Step 6. A valid ICAO should contain three characters only. The characters can be English alphabets or numerical numbers. o If the ICAO value is "\N", "\n" or missing, replace it with "unknown". o Add a new column to store the cleaned data. Do not overwrite the original ICAO value. Step 7. A valid CallSign should contain three characters only. The characters can be English alphabets or numerical numbers. o If the CallSign value is "\N", "\n" or missing, replace it with "unknown". o Add a new column to store the cleaned data. Do not overwrite the original CallSign value. Questions Part A Answer the following questions based on the steps previously described: 1. How many unique tuples are there in the dataset after Step 1? 2. How many unique values are there in the Airline ID attribute after Step 2? 3. How many unique values are there in the Name attribute after Step 3? 4. How many unique values are there in the Alias attribute after Step 4? 5. How many unique values are there in the IATA attribute after Step 5? 6. How many unique values are there in the ICAO attribute after Step 6? 7. How many unique values are there in the CallSign attribute after Step 7? 8. How many unique values are there in the Country attribute after cleaning? 9. How many unique values are there in the Active attribute after cleaning? 10. How many "unknown" are included in the Name attribute after cleaning? 11. How many "unknown" are included in the Alias attribute after cleaning? 12. How many "unknown" are included in the IATA attribute after cleaning? 13. How many "unknown" are included in the ICAO attribute after cleaning? 14. How many "unknown" are included in the CallSign attribute after cleaning? 15. How many "unknown" are included in the Country attribute after cleaning? 16. How many "unknown" are included in the Active attribute after cleaning? 17. How many tuples are pending removal based on the Name attribute after cleaning? 18. How many tuples are pending removal based on the Alias attribute after cleaning? 19. How many tuples are pending removal based on the IATA attribute after cleaning? 20. How many tuples are pending removal based on the ICAO attribute after cleaning? 21. How many tuples are pending removal based on the CallSign attribute after cleaning? 22. How many tuples are pending removal based on the Country attribute after cleaning? 23. How many unique tuples included the cleaned dataset? 24. How many percent of the tuples are removed from the dataset after cleaning? 25. Which attribute causes the most tuples being removed from the dataset? 26. After the data cleaning process, which attribute(s) is/are the target(s) for you to go back and ask for more detail or discuss on the solutions with your client? Part B According to the cleaned data, which country owns the most active operation routes? How many active operation routes is currently owned by that country?
Expert Answer:
Answer rating: 100% (QA)
1 To answer the following questions we need to clean the dataset following the steps described Step 1 import pandas as pd df pdreadcsvairlines2022csv Remove duplicate tuples based on the value of Airl... View the full answer
Related Book For
Business Analytics Communicating With Numbers
ISBN: 9781260785005
1st Edition
Authors: Sanjiv Jaggia, Alison Kelly, Kevin Lertwachara, Leida Chen
Posted Date:
Students also viewed these programming questions
-
Show graphically why the problem of outsourcing production to China, Mexico, and elsewhere is a good move economically for many companies. (consider isoquant and isocosts maps for the USA and for...
-
Planning is one of the most important management functions in any business. A front office managers first step in planning should involve determine the departments goals. Planning also includes...
-
Managing Scope Changes Case Study Scope changes on a project can occur regardless of how well the project is planned or executed. Scope changes can be the result of something that was omitted during...
-
Order check. Write a program that takes three double command-line arguments x, y, and z and prints true if the values are strictly ascending or descending ( x < y < z or x > y > z ), and false...
-
Use Thevenins theorem to find Vo (t), t>0 in the network.
-
In May, Rebecca's daughter, Isabella, sustained a serious injury that made it impossible for her to continue living alone. Isabella, who is a novelist, moved back into Rebecca's home after the...
-
A jet engine is to be designed for an altitude of \(12,000 \mathrm{~m}\), where the atmospheric pressure is \(19.3 \mathrm{kPa}\). The jet nozzle has a supersonic exit Mach number and is perfectly...
-
On January 1, 2013, Bradley Recreational Products issued $100,000, 9%, four-year bonds. Interest is paid semiannually on June 30 and December 31. The bonds were issued at $96,768 to yield an annual...
-
Create a research paper outline about The Right to Abortion. The outline should make your points, in abbreviated form, and include your supporting evidence - i.e., cases (including the 4 legal codes...
-
There is a paradigm that - "There is no point in training staff because they leave so it is an expensive and time consuming exercise!" Q: What is your opinion about the above paradigm?
-
Explain as much as possible on an alternative tax system that has been proposed to replace the current tax system. Explain the expected impact on the economy, the IRS, and other factors that may be...
-
In a creep test at 640 C, a stainless steel specimen loaded in tension at a stress of 80 MPa failed by creep rupture after 1000 hours. This same material is loaded in tension, also at 80 MPa, in an...
-
Randy runs a business where he installs roofs on houses and commercial buildings. Randy has not registered his business with the Secretary of State and assumes all of the risk and liability himself....
-
In order to preserve cash, Splish Inc. negotiated with one of its supplies for amount owing on account. $ 1 0 9 , 0 8 7 was paid off by signing a zero interest note on June 1 , 2 0 2 3 , due June 1 ,...
-
Tombstone Inc. has $ 6 0 , 0 0 0 cash, $ 2 0 0 , 0 0 0 receivables, $ 2 0 0 , 0 0 0 inventory and $ 5 0 0 , 0 0 0 of net fixed assets. It also has $ 6 0 , 0 0 0 of accounts payable, notes payable of...
-
316. The graph of a folium of Descartes with equation 2x + 2y -9xy = 0 is given in the following graph. -3 -3-2 -1 YA 3 2+ 14 -1 -2- 1 2 3* This work, "Problem Set 3," is a derivative of Calculus...
-
An industrial firm is considering whether to adopt palletization for handling materials. The following data have been collected: Item Pallets Tiedown straps Tiedown clips Banding tool Cost $500 (per...
-
(a) Find the equation of the tangent line to f(x) = x 3 at the point where x = 2. (b) Graph the tangent line and the function on the same axes. If the tangent line is used to estimate values of the...
-
The following table lists the population, in millions, in India and China, the two most populous countries in the world, for the years 2013 through 2017. a. Organize the data into a fixed-width...
-
Refer to the previous exercise for a description of the problem and data set. Build a default classification tree to predict whether an individual is likely to attend church. Display the default...
-
A researcher conducts a mileage economy test involving 80 cars. The frequency distribution describing average miles per gallon (mpg) appears in the following table. a. Construct the relative...
-
Temperature is an intensive property. State True (T) or False (F)
-
Kinetic energy is an extensive property. State True (T) or False (F)
-
A nozzle represents an open system. State True (T) or False (F)
Study smarter with the SolutionInn App