Question: Shell Script: Task: Data Cleaning:you can use bash / shell tools and awk script. Note: Atlast I want a tsv file output which is joined
Shell Script:
Task: Data Cleaning:you can use bash shell tools and awk script.
Note: Atlast I want a tsv file output which is joined by three tsv files after cleaning them.Please find the screenshots below of the three files and also excepted output tsv file.I mentioned what needs to be cleaned in points Please read the question carefully.
Our World in Data is a truly excellent resource for high quality data across a wide range of topics. For example, during acute phase of the Covid pandemic, Our World in Data created and hosted many analyses of Covid disease data.
Relevant to this assignment are three, linked datasets from the Gallup world happiness surveys, averaged by country for a range of years, and then aligned with data from national sources related to GDP homicide rates per population, size of the population, and lifeexpectancy. Linked below, tabseparated file tsv versions :
gdpvshappiness.tsvHeaders are :Entity,Code,Year,Cantril ladder score,GDP per capita, PPP constant international $Population historical estimatesContinent
homiciderateunodc.tsvHeaders are: Entity,Code,Year,Homicide rate per population Both sexes All ages
lifesatisfactionvslifeexpectancy.tsvHeaders are:Entity,Code,Year,Life expectancy Sex: all Age: at birth Variant: estimates,Cantril ladder score,Population historical estimatesContinent
Screenshots are provided for the data files below
you will be implementing a datacleaning stage.
Data Cleaning:
You are to create a top level Bash script, called cantrildatacleaning Note: No suffix! which will use Bash plus Shell tools, to clean the data. cantrildatacleaning expects three tsv files corresponding to the three files from Our World in Data. The input files may be in any order. The output is expected to be a tabseparated data directed, as before, to standard output.
The overall program should, for a given datafile:
Based on the header ie top line, make sure that the file is a tabseparated format file
Also based on the header line, report any lines that do not have the same number of cells. Cells are allowed be empty.Ignore the rows and also print the error message to stdout
Remove the column with header Continent, which is sparsely populated and is not present in one of the files.
Ignore the rows that do not represent countries the country code field is empty
Ignore the rows for years outside those for which we have at least some Cantril data. Cantril data may be absent in certain years within the range of those for which there otherwise is data; those cells should be retained.The range of years are to So basically consider the rows only having years between and both inclusive
The output file sent to stdout should have rows with the data in the following order tab separated:
While the contents of the input files may change and the order that they are provided to cantrildatacleaning may vary, you can assume that the order of the columns in the various input files will not change.
Hint: You will notice that the country year combination of cells is unique within each of the three input files.That means country code and a year combination will be present only once in each file
So in short,cantril data cleaning script takes three tab seperate files in any order but columnsheaders inside them does not change need to get a output with a tsv file headers mentioned above.
Screenshots of the three files and sample excpted file are below:
Entity
Monaco Code Year
C
E
F
G
H
Monaco
Monaco
Monaco
Monaco
Monaco
Monaco
Hong Kong
Macao
Hong Kong
Monaco
Hong Kong
Hong Kong
Macao
Macao
Hong Kong
Macao
Monaco
Monaco
Japan
Monaco
Japan
Macao
Monaco
Monaco
Australia
Hong Kong
Monaco
Japan
Australia
Hong Kong
Japan
MCO
MCO
MCO
MCO
MCO
MCO
MCO
HKG
MAC
HKG
MCO
HKG
HKG
MAC
MAC
HKG
MAC
MCO
MCO
JPN
MCO
JPN
MAC
MCO
MCO
AUS
HKG
MCO
JPN
AUS
HKG
JPN
lifesatisfactionvslifeexpec tableEntityCode,Year,Homicide rate per population Both sexes All agesAfghanistanAFG,AfghanistanAFG,AfghanistanAFG,AfghanistanAFG,AfghanistanAFG,AfghanistanAFG,AfghanistanAFG,AfghanistanAFG,AfghanistanAFG,AfghanistanAFG,AfghanistanAFG,Africa UNAfrica UNAfrica UNAfrica UNAfrica UNAfrica UNAfrica UNAfrica UNAfrica UNAfrica UNAfrica UNAfrica UNAfrica UNAfrica UNAfrica UNAfrica UNAfrica UNAfrica UNAfrica UNAfrica UNAfrica UNAfrica UNDelta lhania,Delta F gdpvshappiness tableABcDEFGHIEntityCode,Year,GDP per capi,i Population historical estimates I,Homicide rate per population Both sexes All agLife
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
