Question: Step 1 (Exploratory Data Analysis) In this task, you will consider the task of predicting a species of flower based on the characteristics of the
Step 1 (Exploratory Data Analysis)
In this task, you will consider the task of predicting a species of flower based on the characteristics of the flower. In particular, you will consider clustering an Iris flower as to whether it belongs to one of the following three Iris species: Setosa, Versicolour, or Virginica. To perform this task, you need a data set containing the characteristics of various flowers of these three species. A data set with this type of information is the well-known Iris data set from the University of California at Irvine Machine Learning Repository.
https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
It consists of information on 150 Iris flowers, 50 each from one of three Iris species: Setosa, Versicolour, and Virginica. Each flower is characterized by four attributes:
- sepal length in centimetres
- sepal width in centimetres
- petal length in centimetres
- petal width in centimetres
In this step, as demonstrated in the case study, by using Python code you are going to
- retrieve data from the above repository, clean them, i.e., check for any NAN (Not A Number) or na, remove them if any, and change it to appropriate format so that it only has the required attributes or labels, i.e., sepal width, sepal length, petal length, and petal width in your dataset. Using the head() method your data should look like:
You must show the first 10 rows.
- extract meaningful statistics. These include a summary statistics of your data, including a number of your data (count), its mean, standard deviation, minimum, maximum for each label or attribute. Then you are going to obtain count, mean, standard deviation, minimum, maximum for each species.
- provide appropriate visualisation of your dataset so that they show statistical significance and clusters in your data. You must at least use pairplot() and a scatterplot(). Your scatter plot should show the petal widths attribute vs the petal length attribute from your dataset.
- From the above Exploratory Data Analysis, you should conclude and show 3 clusters in your dataset.
Step 2 (plotting singlelinkagedendrogram)
In this step, you limit your dataset only to the first 6 rows of sepal widths and sepal length.
You are going to develop a Python code that uses the Agglomerative Hierarchical Clustering Algorithms, Euclidean Distance and Single Linkage (minimum distance between two clusters to merge them) to cluster above 6 data records using dendrogram.
Here you are required to look at your data and try to understand them. You can assume sepal widths as 'x' and sepal length as 'y' coordinates. To start you can use scatterplot() to draw each point. Then you are going to use your lecture on Hierarchical Clustering Algorithms to plot a dendrogram using Single Linkage to merge two closest clusters together. You must write a step by step algorithm for your specific Agglomerative Hierarchical Clustering with its associated flow chart. For each epoch your code must show:
- Your calculations
- Your cluster based on the updated distance matrix.
- Your dendrogram progress and the new clusters in your dendrogram plot.
Step 3.
Answer the following question
How do you interpret your dendrogram? how many species does it show?
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
