Question: Please answer the following questions in python jupyter, solve 2-G and 2-H but in order to solve them I guess you need the solve the
Please answer the following questions in python jupyter, solve 2-G and 2-H but in order to solve them I guess you need the solve the previous ones too. Import these packages before starting, import numpy as np import pandas as pd import seaborn as sns import math from sklearn import preprocessing from sklearn import datasets import sklearn from scipy import stats import matplotlib import matplotlib.pyplot as plt %matplotlib inline matplotlib.style.use('ggplot') np.random.seed(1) Q2 (70 points) Working with Data
In [2]:X = datasets.load_wine(as_frame=True)c
data = pd.DataFrame(X.data, columns=X.feature_names)
data['class'] = pd.Series(X.target)
data = data.drop(list(data.columns[5:-1]),axis=1) #Keep only the first five columns and the class label
print(" classes ",data['class'].unique()) #The different class labels in the data .. We have three class labels, 0, 1, 2 print(" class distribution ",data['class'].value_counts()) #Shows the number of rows for each class data.info()
data.head()
Q2-A (10 points) Construct a scatter plot between the 'ash' and 'malic_acid' columns
Use the plt.scatter to plot these two variables
Use the data['class'] to color the points
Q2-B (10 points) Construct a scatter matrix between all the attributes, except the last attribute, Class
Use the sns.pairplot function to plot the scatter matrix .. use the 'class' attribute as the hue
Q2-C (5 points) Normalize the data such that each attribute has a minimum of 0 and a maximum of 1
Don't change the content of the original dataframe. The final result will be stored in data_scaled
Q2-D (5 points) Standarize the data such that each attribute has a mean 0 and a standard deviation of 1 (unit variance)
Hint: use preprocessing.StandardScaler
Don't change the content of the original dataframe. The final result will be stored in data_scaled
Q2-E Equal-Width Binning (5 points)
Convert the values in each attribute to discrete values and use 5 bins.
Use the pandas cut method, pd.cut
Q2-F Equal Frequency Binning (5 points)
Convert the values in each attribute to discrete values and use 5 bins.
Use the pandas qcut method, pd.qcut
Q2-G Sampling (15 points)
Construct three samples from the original datasets
data_sample1: select 30 random rows
data_sample2: select 10 random rows from each class, for a total of 30 rows
data_sample3: select 17% random rows from each class. Hint: use frac=0.17
In [ ]:#data_sample1 Sample 30 rows from the data
#data_sample2 Sample 10 rows from each class
#data_sample3 Sample 17% for each class
#uncomment the following three lines to check your results
#print("Sample1 Size ", len(data_sample1)," ", data_sample1.head(30)) #print(" Sample2 Size ", len(data_sample2)," ", data_sample2.head(30)) #print(" Sample3 Size ", len(data_sample3)," ",data_sample3.head(30)) Q2-H (15 points)
Write Python code to answer the following questions with respect to the wine data set. You can use Pandas DataFrame:
What is the correlation coefficient between 'magnesium' and 'ash' for rows with class label 2?
What is the average of the 'ash' columns for rows with class label 1?
What are the averages for all the columns for rows with class label 0? -- use mean in dataframe
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
