Question: Python, Import the following packages first: import numpy as np import pandas as pd import seaborn as sns import math from sklearn import preprocessing from
Python, Import the following packages first:
import numpy as np import pandas as pd import seaborn as sns import math from sklearn import preprocessing from sklearn import datasets import sklearn from scipy import stats import matplotlib import matplotlib.pyplot as plt %matplotlib inline matplotlib.style.use('ggplot') np.random.seed(1)
That's the question
Q2 (20 points) Working with Data
In [ ]:X = datasets.load_wine(as_frame=True)
data = pd.DataFrame(X.data, columns=X.feature_names)
data['class'] = pd.Series(X.target)
data = data.drop(list(data.columns[5:-1]),axis=1) #Keep only the first five columns and the class label
print(" classes ",data['class'].unique()) #The different class labels in the data .. We have three class labels, 0, 1, 2 print(" class distribution ",data['class'].value_counts()) #Shows the number of rows for each class data.info()
data.head()
Q2-A Construct a scatter plot between the 'ash' and 'malic_acid' columns
Use the plt.scatter to plot these two variables
Use the data['class'] to color the points
In [ ]:#Type your answer
Q2-B (10 points) Construct a scatter matrix between all the attributes, except the last attribute, Class
Use the sns.pairplot function to plot the scatter matrix .. use the 'class' attribute as the hue
In [ ]:#Type your answer
Q- Normalize the data such that each attribute has a minimum of 0 and a maximum of 1
Don't change the content of the original dataframe. The final result will be stored in data_scaled
In [ ]:#Normalizing all the columns .. Accessing the columns with the columns' names
# from sklearn.preprocessing import StandardScaler
# from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data_scaled = scaler.fit_transform(data)
data_scaled = pd.DataFrame(data_scaled, columns=data.columns)
data_scaled = data.copy()
Q-B Standarize the data such that each attribute has a mean 0 and a standard deviation of 1 (unit variance)
Hint: use preprocessing.StandardScaler
Don't change the content of the original dataframe. The final result will be stored in data_scaled
In [ ]:#Standarizing all the columns .. Accessing the columns with the columns' names
data_scaled = data.copy()
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
data_scaled = pd.DataFrame(data_scaled, columns=data.columns)
Discretization
Q-C Equal-Width Binning
Convert the values in each attribute to discrete values and use 5 bins.
Use the pandas cut method, pd.cut
In [ ]:data_discrete = data.copy
for column in data_discrete.columns[:-1]:
data_discrete[column] = pd.cut(data_discrete[column], bins=5, labels=False)
Q-D Equal Frequency Binning
Convert the values in each attribute to discrete values and use 5 bins.
Use the pandas qcut method, pd.qcut
In [ ]:data_freq = data.copy()
for column in data_freq.columns[:-1]:
data_freq[column] = pd.qcut(data_freq[column], q=5, labels=False)
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
