Question: I need help completing my Exploratory Data Analysis (EDA) for a project. My goal is to create an EDA section that is strong enough to
I need help completing my Exploratory Data Analysis (EDA) for a project. My goal is to create an EDA section that is strong enough to earn full points in rubric areas for relevance, analysis, and visualization quality. To do that, I need support planning out what my EDA should include and how to structure it.
I want the EDA to clearly meet these goals:
Fully explore all important variables in the data
Use visualizations that are appropriate, clear, and visually appealing
Explain what each plot means and how it relates to the research question
Demonstrate analytical reasoning, not just charts with no explanation
Connect results to the research question logically and clearly
My research question:
Do countries that consume more gluten have higher rates of Celiac disease?
What I need from you:
Please help me outline an EDA plan that includes:
Which variables from each dataset are most important to analyze
Recommended visuals (maps, scatterplots, histograms, correlations, etc.)
What each visualization should reveal or help answer
How to word the interpretation and insights under each plot
How to make the EDA complete, deep, polished, and visually strong
The goal is a detailed outline I can follow to fully complete the assignment.Here is the data provided:
Data overview
Dataset #1
- Dataset Name: Grain Consumption By Country For 2023
- Link to the dataset: Downloaded from: https://www.fao.org/faostat/en/#data/FBS
- Number of observations: 492
- Number of variables: 8
- Important variables:
- Country: The country the data is for
- Food Type: The type of food being measured based on its ingredients (wheat, barley, or rye)
- Food Supply By Weight: The available supply of a food type in the country in 2023 per person (in kg per capita)
- Daily Caloric Food Supply: The average daily supply of a food type in the country in 2023 per person (in kcal per capita per day)
- Shortcomings include missing values for many countries when it comes to barley and especially rye, and that not every country is included in the dataset (though most are).
Dataset #2
- Dataset Name: Celiac Rates From Different Studies
- Link to the dataset: Copied from: https://doi.org/10.1016/j.cgh.2017.06.037
- Number of observations: 57
- Number of variables: 13
- Important variables:
- Country: The country a study was conducted in
- Patients With CD: Number and rate of patients found having celiac disease
- Population: The population the study was done on
- Bias Score: A prediction on how biased a study might be
- Some shortcomings are that these studies tend to be 10 to 20 years old, so could have outdated celiac rates, and that dataset does not have data for that many countries, since many studies are for the same country.
Dataset #3
- Dataset Name: European Celiac Rates
- Link to the dataset: Copied from: https://pmc.ncbi.nlm.nih.gov/articles/PMC4146836
- Number of observations: 14
- Number of variables: 2
- Important variables:
- Country: The European country the data is for
- Prevalence: The prevalence rate for celiac disease
- It is an extremely small dataset, with only 14 countries, and the data only has two variables, so it's unclear how the data was calculated, and on what populations it comes from.
Combining the datasets: The three datasets can be compared and analyzed to identify patterns in gluten consumption and prevalence of celiac disease. We may find correlations between the datasets which support our hypothesis that countries which consume more gluten have higher rates of Celiac Disease. The best way to do so would be to use country as a unique identifier and merge the datasets based on that, while making sure to deal with duplicate country values. So, the 2nd and 3rd datasets would be merged into the 1st by adding a new column for celiac rate. # Run this code every time when you're actively developing modules in .py files. It's not needed if you aren't making modules # ## this code is necessary for making sure that any modules we load are updated here ## when their source code .py files are modified %load_ext autoreload %autoreload 2 # Setup code -- this only needs to be run once after cloning the repo! # this code downloads the data from its source to the `data/00-raw/` directory # if the data hasn't updated you don't need to do this again! # if you don't already have these packages (you should!) uncomment this line %pip install requests tqdm import sys sys.path.append('./modules') # this tells python where to look for modules to import import get_data # this is where we get the function we need to download data # replace the urls and filenames in this list with your actual datafiles # yes you can use Google drive share links or whatever # format is a list of dictionaries; # each dict has keys of # 'url' where the resource is located # 'filename' for the local filename where it will be stored ### Original links where data comes from: ### 1. https://www.fao.org/faostat/en/#data/FBS ### 2. https://doi.org/10.1016/j.cgh.2017.06.037 ### 3. https://pmc.ncbi.nlm.nih.gov/articles/PMC4146836 datafiles = [ {'url': 'https://drive.google.com/uc?export=download&id=1bywjRuqGPZSiKgJNwzOozKZ4SHg0Ahz3', 'filename':'raw-grain-by-country.csv'}, {'url': 'https://drive.google.com/uc?export=download&id=1zB0hszdU-_3PcGn3RjK4Yff2-epsIS3l', 'filename':'raw-global-celiac.csv'}, {'url': 'https://drive.google.com/uc?export=download&id=16LX4-K_zmTv-onRc8tzqHHg1tqnHoeTA', 'filename':'raw-euro-celiac.csv'} ] get_data.get_raw(datafiles,destination_directory='data/00-raw/') Grain Consumption By Country For 2023 The important metrics include a column for each country, and a column for the different food types that are being measured, which are wheat, barley, or rye. There are also two different quantities for how the food supply per country is being measured, one is kg per capita, which measures a country's food supply of a specific type of foods divided by its population. The other measure is food supply in kilocalories per capita, so measures the total kcal available of that food type divided by the population. Both of these measures are only for 2023 in the data. Food supply refers to the available amount of food available for consumption, but is not the same as how much food was actually consumed, so does not account for food waste. This data is mostly reliable, since it comes from the UN Food and Agriculture Organization, however this data is compiled based on both official and unofficial data, which could be inaccurate. Official data provided by countries could be biased, since countries might be incentivized to overstate food production and consumption levels to make their country seem more prosperous than it actually is. Also, while the dataset includes data for most countries in the world, there is missing data for some countries for barley and rye supply, meaning there are missing values for them, so countries without data for all three food types might have to be removed or have their values replaced. import pandas as pd pd.set_option('display.max_rows', 10) raw_grain_df = pd.read_csv('data/00-raw/raw-grain-by-country.csv') # Drop unnecessary columns raw_grain_df = raw_grain_df[['Area', 'Element', 'Item', 'Value']] print('Raw data:') display(raw_grain_df) # Make data tidy (value isn't a variable) kg_grain_df = raw_grain_df[raw_grain_df['Element'] == 'Food supply quantity (kg/capita/yr)'].drop(columns = 'Element') kcal_grain_df = raw_grain_df[raw_grain_df['Element'] == 'Food supply (kcal/capita/day)'].drop(columns = 'Element') grain_df = pd.merge(kg_grain_df, kcal_grain_df, how = 'outer', on = ['Area', 'Item']) grain_df = grain_df.rename(columns = {'Area': 'Country', 'Item': 'Food Type', 'Value_x': 'Supply (kg/capita)', 'Value_y': 'Daily Supply (kcal/capita/day)'}) # Write to interim data folder grain_df.to_csv('data/01-interim/tidied-grain-by-country.csv', index = False) print(' Tidied data:') display(grain_df) # Data is 492 rows and 4 columns print(grain_df.shape) print(' Finding missing values:') # Missing values (NaN most likely represents 0.0 or small value close to 0.0) display(grain_df[grain_df.isnull().any(axis = 1)]) # Is supply always NaN when daily supply is zero? (No) display(grain_df[grain_df['Daily Supply (kcal/capita/day)'] == 0.0].sort_values('Supply (kg/capita)')) print(' Finding outliers:') # Looking for outliers (looks like there's none) display(grain_df.sort_values('Supply (kg/capita)', ascending = False)) display(grain_df.sort_values('Daily Supply (kcal/capita/day)', ascending = False)) # For missingness, daily supply is zero, implying supply is zero for NaN values # Checking if supply = 0 means daily supply has to be 0 (No, implying supply is rounded down to 0 often) print('When supply = 0, does daily supply = 0 always?') display(grain_df[grain_df['Supply (kg/capita)'] == 0.0].sort_values('Daily Supply (kcal/capita/day)')) # However, daily supply = 0 usually means that supply = 0, so it would still make sense to replace NaN values with 0 grain_df = grain_df.fillna(0.0) # Write final dataframe to data folder grain_df.to_csv('data/02-processed/grain-by-country.csv', index = False) # Summary stats for supply and daily supply grain_df.describe() Celiac Rates From Different Studies The dataset used was transcribed from a meta analysis and review conducted by Singh et. al published in the Journal of Clinical Gastroenterology and Hepatology in 2018. The study analyzed the seroprevalence (percentage of individuals who tested positive for Celiac disease through antibody tests), and biopsy confirmed prevalence of individuals with Celiac disease. Seroprevalence is often higher as not every patient who tests positive for antibodies will go on to receive biopsy confirmation as it is a more invasive medical procedure, and requires further access to care which may not be globally available. Another important metric is a bias score, which helps measure how biased data from a particular study might be. As this meta-analysis is of a compilation of pre-existing datasets throughout the world, it is limited by the data available for analysis. Some noted limitations were poor population sizes in datasets, inadequate amounts of age and sex based data, and the fact that some data comes from potentially biased populations, such as children or healthy blood donors. Also, some of this data is somewhat old, so might reflect very outdated celiac disease rates. raw_global_df = pd.read_csv('data/00-raw/raw-global-celiac.csv', thousands = ',') # Drop unnecessary columns raw_global_df = raw_global_df.drop(columns = ['Year of study', 'Region', 'Risk of bias', 'Serology used', 'Seropositive']) # Data is tidy since each row is a unique study print(f"Number of duplicate studies in same country: {len(raw_global_df[raw_global_df.duplicated(subset = ['Study', 'Country'])])}") # Study column no longer needed global_df = raw_global_df.drop(columns = ['Study']) # Standardize column names global_df = global_df.rename(columns = {'Children/adults/both': 'Age Group', 'Sample size': 'Sample Size', 'Patients with CD': 'Rate'}) # Write to interim data folder global_df.to_csv('data/01-interim/interim-global-celiac.csv', index = False) print(' Data at this stage:') display(global_df) print(' Finding missing values:') # Missing values (seems to be the total row) display(global_df[global_df.isnull().any(axis = 1)]) # We don't need the totals row, so remove it global_df = global_df.dropna() print(' Finding outliers:') # Looking for outliers (looks like there's none) display(global_df.sort_values('Sample Size', ascending = False)) display(global_df['Rate'].unique()) # Data is 57 rows and 7 columns print(f' Shape: {global_df.shape}') # dtype for year column seems wrong print(global_df.dtypes) global_df['Year'] = global_df['Year'].astype('int') # Rate (celiac disease rate) should only have the percentage global_df['Rate'] = global_df['Rate'].str.split(expand = True)[0].astype('float') global_df['Rate'] = (global_df['Rate'] * 100) / global_df['Sample Size'] display(global_df) # Write final dataframe to data folder global_df.to_csv('data/02-processed/global-celiac.csv', index = False) # Summary stats for global_df global_df.describe() European Celiac Rates This dataset comes a study which estimated rates of celiac in European countries based on multiple different studies. It's a very simply dataset, with a column for the country, and another one for the rate of prevalency of celiac disease. Some concerns are the fact that it's unclear exactly how the data values were calculated, only that they came from four different cited studies. Two of these studies focus on children, so that will bias the results towards celiac rates in children. There is also no measurement of how biased the cited studies are or what sample sizes they used. euro_df = pd.read_csv('data/00-raw/raw-euro-celiac.csv') # Rename column euro_df = euro_df.rename(columns = {'Prevalence': 'Rate'}) display(euro_df) # Data is tidy since each row is a unique country print(f"Number of duplicate countries: {len(euro_df[euro_df.duplicated(subset = ['Country'])])}") print(' Finding missing values:') # Missing values (no missing values) display(euro_df[euro_df.isnull().any(axis = 1)]) # Standardize rate to be a percent euro_df['Rate'] = 100 / euro_df['Rate'].str.split('/', expand = True)[1].astype('float') print(' Finding outliers:') # Looking for outliers (looks like there's none) display(euro_df.sort_values('Rate', ascending = False)) # Data is 14 rows and 2 columns print(euro_df.shape) # Write final dataframe to data folder euro_df.to_csv('data/02-processed/euro-celiac.csv', index = False) # Summary stats for celiac rate euro_df.describe() Exploratory Data Analysis Instructions: replace the words in this subsection with whatever words you need to setup and preview the EDA you're going to do. Please explicitly load the fully wrangled data you will use from data/02-processed. This is a good idea rather than forcing people to re-run the data getting / wrangling cells above. Sometimes it takes a long time to get / wrangle data compared to reloading the fixed up dataset. Carry out whatever EDA you need to for your project in the code cells below. Because every project will be different we can't really give you much of a template at this point. But please make sure you describe the what and why in text here as well as providing interpretation of results and context. Please note that you should consider the use of python modules in your work. Any code which gets called repeatedly should be modularized. So if you run the same pre-processing, analysis or visualiazation on different subsets of the data, then you should turn that into a function or class. Put that function or class in a .py file that lives in modules/. Import the module you made and use it to get your work done. For reference see get_raw() which is inside modules/get_data.py. Section 1 of EDA - please give it a better title than this Some more words and stuff. Remember notebooks work best if you interleave the code that generates a result with properly annotate figures and text that puts these results into context. ## YOUR CODE HERE ## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION Section 2 of EDA ## YOUR CODE HERE ## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
