implement the agglomerative hierarchical clustering algorithm. We will implement three different cluster similarity measures: Single link,...
Fantastic news! We've Found the answer you've been seeking!
Question:
Transcribed Image Text:
implement the agglomerative hierarchical clustering algorithm. We will implement three different cluster similarity measures: Single link, Complete link, and Average link. We will be using geographical data for this assignment. Each of the data points is a 2D vector, with longitude and latitude as its dimensions. Here is a brief review of agglomerative clustering. During the clustering process, we iteratively aggregate the most similar two clusters until there are K clusters left. For initialization, each data point forms its own cluster. The similarity of two clusters C₁, C'; is determined by a distance measure. We use the following three measures in this assignment: Single link: D(C₁, C;) = min{d(vp, va) | Vp Ci, va € Cj} Complete link: D(C₁, C₂) = max{d(vp, Vg) | Vp € C₁, Vq € Cj} Average link: D(C₁, C;) = mean{d(vp, Vq) | Vp € C₁, Vq € Cj } The smaller the distance is, the more similar the two clusters are. In the equations, d(...) is a distance measure between two data points. In this assignment, for simplicity, we use the Euclidean distance, defined by: In the equations, d(.,.) is a distance measure between two data points. In this assignment, for simplicity, we use the Euclidean distance, defined by: d(p, q) = √i (Pi — qi ) ² where pi, qi are dimensions of p, q. Your task: complete the missing functions in the given programming template files. Programming Assignment: Hierarchical Clustering Programming Template Find in template/ You are allowed to use one of three programming languages for this assignment: Python (3.10), C++ (14), or Java (JDK version 17). You are provided with the template code for each language containing descriptions for the expected input and output for each function that you would have to write. Sample Input and Output Find in sample_test_cases/ To aid your understanding, we have included one test case for each of Single Link, 1 Sample Input and Output Find in sample_test_cases/ To aid your understanding, we have included one test case for each of Single Link, Complete Link, and Average Link clustering within this problem statement. The input format is as follows: The first line of the input contains three space-separated integers N, K, M: 1. N: The number of data points (lines) following the first line. 2. K: The number of output clusters. 3. M: The cluster similarity measure. M = 0 for single link, M = 1 for complete link, M = 2 for average link. Starting from the second line, each line will have exactly two floating point numbers, representing the longitude and latitude of a location. Each line corresponds to a 2D data point which will be fed into the clustering algorithm. Thus, in the input file, there would be N + 1 lines in total. The objective would be to use the similarity measure specified by M to cluster the N data points into K clusters. Note here that the test case format is provided for you to understand the test case files that will be used. For whichever language you use, you would NOT have to read in test cases from STDIN. That will be handled by the grader. You simply need to complete the required functions in the template code file. Test Scale For all public and hidden test cases, N≤ 100, K ≤ 40. Allowed Libraries For each language, only standard built-in libraries would be usable. For example, if using Python, you would NOT be able to use libraries like NumPy or Scikit-Learn. Submission You need to submit ONE completed template code file to Gradescope. submission.py or submission.cpp Or Submission.java. Note the capitalization for the java file. On Gradescope, your code will be tested on a list of public test cases, for which, you will be able to view both the input and the expected output. At the end of the submission deadline, you will be able to see the results produced by your code on a list of hidden test cases. implement the agglomerative hierarchical clustering algorithm. We will implement three different cluster similarity measures: Single link, Complete link, and Average link. We will be using geographical data for this assignment. Each of the data points is a 2D vector, with longitude and latitude as its dimensions. Here is a brief review of agglomerative clustering. During the clustering process, we iteratively aggregate the most similar two clusters until there are K clusters left. For initialization, each data point forms its own cluster. The similarity of two clusters C₁, C'; is determined by a distance measure. We use the following three measures in this assignment: Single link: D(C₁, C;) = min{d(vp, va) | Vp Ci, va € Cj} Complete link: D(C₁, C₂) = max{d(vp, Vg) | Vp € C₁, Vq € Cj} Average link: D(C₁, C;) = mean{d(vp, Vq) | Vp € C₁, Vq € Cj } The smaller the distance is, the more similar the two clusters are. In the equations, d(...) is a distance measure between two data points. In this assignment, for simplicity, we use the Euclidean distance, defined by: In the equations, d(.,.) is a distance measure between two data points. In this assignment, for simplicity, we use the Euclidean distance, defined by: d(p, q) = √i (Pi — qi ) ² where pi, qi are dimensions of p, q. Your task: complete the missing functions in the given programming template files. Programming Assignment: Hierarchical Clustering Programming Template Find in template/ You are allowed to use one of three programming languages for this assignment: Python (3.10), C++ (14), or Java (JDK version 17). You are provided with the template code for each language containing descriptions for the expected input and output for each function that you would have to write. Sample Input and Output Find in sample_test_cases/ To aid your understanding, we have included one test case for each of Single Link, 1 Sample Input and Output Find in sample_test_cases/ To aid your understanding, we have included one test case for each of Single Link, Complete Link, and Average Link clustering within this problem statement. The input format is as follows: The first line of the input contains three space-separated integers N, K, M: 1. N: The number of data points (lines) following the first line. 2. K: The number of output clusters. 3. M: The cluster similarity measure. M = 0 for single link, M = 1 for complete link, M = 2 for average link. Starting from the second line, each line will have exactly two floating point numbers, representing the longitude and latitude of a location. Each line corresponds to a 2D data point which will be fed into the clustering algorithm. Thus, in the input file, there would be N + 1 lines in total. The objective would be to use the similarity measure specified by M to cluster the N data points into K clusters. Note here that the test case format is provided for you to understand the test case files that will be used. For whichever language you use, you would NOT have to read in test cases from STDIN. That will be handled by the grader. You simply need to complete the required functions in the template code file. Test Scale For all public and hidden test cases, N≤ 100, K ≤ 40. Allowed Libraries For each language, only standard built-in libraries would be usable. For example, if using Python, you would NOT be able to use libraries like NumPy or Scikit-Learn. Submission You need to submit ONE completed template code file to Gradescope. submission.py or submission.cpp Or Submission.java. Note the capitalization for the java file. On Gradescope, your code will be tested on a list of public test cases, for which, you will be able to view both the input and the expected output. At the end of the submission deadline, you will be able to see the results produced by your code on a list of hidden test cases.
Expert Answer:
Answer rating: 100% (QA)
To implement the Agglomerative Hierarchical Clustering algorithm with three different cluster similarity measures Single link Complete link and Averag... View the full answer
Related Book For
Business Analytics Communicating With Numbers
ISBN: 9781260785005
1st Edition
Authors: Sanjiv Jaggia, Alison Kelly, Kevin Lertwachara, Leida Chen
Posted Date:
Students also viewed these programming questions
-
What characteristics would you expect to see in a reader who should be at the emergent and beginning stages of reading?
-
Planning is one of the most important management functions in any business. A front office managers first step in planning should involve determine the departments goals. Planning also includes...
-
Managing Scope Changes Case Study Scope changes on a project can occur regardless of how well the project is planned or executed. Scope changes can be the result of something that was omitted during...
-
Some observers maintain that not all politicians move toward the middle of the political spectrum in order to obtain votes. They often cite Barry Goldwater in the 1964 presidential election and...
-
The lever BCD is hinged at C and is attached to a control rod at B. If 200 P = N, determine (a) The tension in rod AB, (b) The reaction at C. 75 mm 90 30 mm 40 mm
-
a. When isolated chloroplasts are placed in buffer solution with a blue dye such as DCPIP or methylene blue and illuminated, the blue colour disappears. Explain this observation. b. Name the...
-
For each of the following situations, calculate the standard error of the mean \(s \mathrm{X}^{-}\). a. \(s=7.00 ; N=49\) b. \(s=2.50 ; N=14\) c. \(s=8.90 ; N=23\) d. \(s=25.61 ; N=54\)
-
Arnold Industries has pretax accounting income of $33 million for the year ended December 31, 2011. The tax rate is 40%. The only difference between accounting income and taxable income relates to an...
-
The sheet Inventory lists a grocery store's inventory for two months. Calculate the percent change for each item to two decimal places. If the formula gives an error, put "Initial Stock" in the cell....
-
Taiwan is a major world supplier of semiconductor chips. A recent earthquake severely damaged the production facilities of Taiwanese chip - producing companies, sharply reducing the amount of chips...
-
A collimated beam of monochromatic X-rays of wavelength 0.162 nm is incident upon a powdered sample of the cubic metal palladium. Peaks in the scattered X-ray pattern are observed at 20 angles of...
-
Discuss the importance of surprise check.
-
What is interim audit? Why it is required?
-
What is routine checking? What types of work are included in routine checking? What are the objectives of routine checking? Describe the advantages and disadvantages of routine checking. Discuss the...
-
State the controls that can be applied over inputs and processing of data in a computerised accounting environment.
-
Distinguish between principles of auditing and techniques of auditing.
-
Assuming there are no transaction costs or broker fees in the market and [20] on the 1st of January 2023 the Emirates stock is trading at $225 a share when: o There is a $20 annual dividend yield...
-
Use this circle graph to answer following Exercises. 1. What fraction of areas maintained by the National Park Service are designated as National Recreation Areas? 2. What fraction of areas...
-
The accompanying data set contains three predictor variables (x 1 , x 2 , and x 3 ) and the target variable (y). Partition the data to develop a nave Bayes classification model where 1 denotes the...
-
A manufacturing manager uses a dexterity test on 120 current employees in order to predict watch production based on time to completion (in seconds) and a Male dummy variable. A portion of the data...
-
A financial analyst wants to compare the performance of the stocks of two Internet companies, Amazon (AMZN) and Google (GOOG). She records the average closing prices of the two stocks for the years...
-
Coherent light of wavelength \(\lambda\) is normally incident on two slits separated by a distance \(d\). What is the greatest possible fringe order?
-
Coherent green light of wavelength \(530 \mathrm{~nm}\) passes through two very narrow slits separated by \(1.00 \mu \mathrm{m}\). What is \((a)\) the angular location of the first-order bright...
-
Figure 34.21 shows diffracted \(\mathrm{x}\)-ray intensity as a function of the Bragg angle \(\alpha\), obtained using \(\mathrm{x}\) rays having a wavelength of \(0.11 \mathrm{~nm}\). (a) Without...
Study smarter with the SolutionInn App