Question: How to do label encoding and one hot encoding in linux(shellscript)(bash script) The dataset is a set of features arranged as columns that are separated
The dataset is a set of features arranged as columns that are separated by a delimiter. In this project, the delimiter is the semicolon ; as shown in the figure below. The first row of the dataset is the header which presents the names of the features. The features in the dataset can be of two types; (1) numeric features of type integer such as age, height, and weight, and (2) categorical features such as gender, active, smoke, and governorate. Figure 1: A snapshot of a dataset in a text file. Dataset is Clean In this project, we will assume that the dataset has been cleaned. I.e., all rows in the dataset contain values for all features with the correct data type and there are no missing values. Encode Features For categorical features, there are two types of encoding; label encoding and one-hot encoding. The description of each of these encodings is as follows: 1) Label encoding replaces categorical data with integer codes. As an example, the label encoding of the governorate features in the dataset of figure 1 will replace ramallah with 0 , nablus with 1 , and jerusalem with 2 as shown in figure 2. Figure 2: label encoding of the governorate feature 2) One-hot encoding splits the categorical feature column into multiple columns and each sample is encoded by 0 or 1. As an example, the governorate feature in the dataset of figure 1 is replaced with three column features; ramallah, nablus, and jerusalem. Also, the code of these new features for the sample with id=1 who lives in Ramallah is 1;0;0 as shown in figure 3
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
