The goal of this assignment is to try to figure out from which of the 4...

Fantastic news! We've Found the answer you've been seeking!

Question:

Transcribed Image Text:

The goal of this assignment is to try to figure out from which of the 4 initial COVID-19 patients the other 96 are most like to have contracted the virus. You will read a set of parameters as defined in the "Input File Format" section above as well as a number of points from a file and perform naïve k-means clustering on them. Your final output will include the number of iterations required to achieve convergence and, for each cluster, the final location of its centroid, the number of points in the cluster, and a list of points contained in that cluster (examples provided in the "Outputs" section). Please note this is just a helpful guide to get you started. Correctness will primarily be based on your outputs (defined below), but your individual implementation may vary. Requirements 1. You will implement your solution in a file called kmeans.py 2. Your solution must implement k-means by implementing the algorithm above and may not leverage external Python libraries to perform the clustering 3. Since we are processing multiple input files, your solution should include a list containing the two input file names (points1.txt and points2.txt) and iterate over both files within a single execution of your program. Input File Format The input files (points1.txt, points2.txt) will consist of the maximum number of clustering iterations at the beginning of the input file, followed by N, the total number of points (patients) in the input file, followed by k, the number of clusters (initially infected patients), followed by k centroids (initially infected patients), followed by N-k patients' locations, each on a separate line. The coordinates for all points are integers and are separated by commas. When reading the data in from the 2 files you may find string methods such as strip() and split() useful. Here is a (truncated) overview of the file contents for points1.txt. The comments are provided here for clarity but are not included in the actual input files. 50 # max. number of iterations 100 # number of patients in input file (COVID-19 patient locations; 4 initially infected (initial cluster centroids), 96 who contracted the virus from those 4, for a total of 100) 4 # total number of clusters (k, the number of initially infected patients) 30,45 # cluster centroid 1 (the first of the 4 initially infected patients) 55,82 # cluster centroid 2 (the second of the 4 initially infected patients) 61 # cluster centroid 3 (the third of the 4 initially infected patients) 96,14 # cluster centroid 4 (the fourth of the 4 initially infected patients) 83,13 # the remaining 96 (N = 100 minus k = 4) patients of the form x,y 81,32 In a file named kmeans.py, please do the following: 1. Begin by opening kmeans.txt for reading the first 3 fields line-by-line. You'll then want to use the values to dynamically build your data structures for things like your clusters and their corresponding centroids in Step 2. A tuple or list will make the most sense but you'll want to think about the differences between before choosing. There is a solution involving both, though, so feel free to get started using whichever data structure you think makes the most sense. If at some point you decide you want to switch from one data structure to the other, it should only require minimal changes to your code, so no worries there. 2. Create any data structures necessary to store your points, clusters, previous cluster sizes, etc. 3. Create a processing loop to do the following: a. For each point in your list: i. Compute the distance between the current patient and each of the k centroids ii. Add the patient to the cluster to which the nearest centroid belongs b. For each cluster, check its size from the previous iteration (see Section c.iv) against the size of the clusters computed directly above in step (a) i. If any of the cluster sizes changed from the previous iteration, increment a variable that counts the number of iterations required to achieve convergence For each cluster: C. i. Compute the mean of all x values ii. Compute the mean of all y values iii. Update the cluster's centroid with the results from (i) and (ii) iv. Store/update the current size of each cluster in a list For each input file, points1.txt and points2.txt, please output the following: 1. A list of the 4 initial COVID-19 patients (centroids) 2. The number of iterations required to obtain stability 3. The values for the 4 final centroids 4. For each centroid print: a. The centroid ID followed by the final number of patients in the centroid's cluster A list of the "patients" (points) contained in the cluster The goal of this assignment is to try to figure out from which of the 4 initial COVID-19 patients the other 96 are most like to have contracted the virus. You will read a set of parameters as defined in the "Input File Format" section above as well as a number of points from a file and perform naïve k-means clustering on them. Your final output will include the number of iterations required to achieve convergence and, for each cluster, the final location of its centroid, the number of points in the cluster, and a list of points contained in that cluster (examples provided in the "Outputs" section). Please note this is just a helpful guide to get you started. Correctness will primarily be based on your outputs (defined below), but your individual implementation may vary. Requirements 1. You will implement your solution in a file called kmeans.py 2. Your solution must implement k-means by implementing the algorithm above and may not leverage external Python libraries to perform the clustering 3. Since we are processing multiple input files, your solution should include a list containing the two input file names (points1.txt and points2.txt) and iterate over both files within a single execution of your program. Input File Format The input files (points1.txt, points2.txt) will consist of the maximum number of clustering iterations at the beginning of the input file, followed by N, the total number of points (patients) in the input file, followed by k, the number of clusters (initially infected patients), followed by k centroids (initially infected patients), followed by N-k patients' locations, each on a separate line. The coordinates for all points are integers and are separated by commas. When reading the data in from the 2 files you may find string methods such as strip() and split() useful. Here is a (truncated) overview of the file contents for points1.txt. The comments are provided here for clarity but are not included in the actual input files. 50 # max. number of iterations 100 # number of patients in input file (COVID-19 patient locations; 4 initially infected (initial cluster centroids), 96 who contracted the virus from those 4, for a total of 100) 4 # total number of clusters (k, the number of initially infected patients) 30,45 # cluster centroid 1 (the first of the 4 initially infected patients) 55,82 # cluster centroid 2 (the second of the 4 initially infected patients) 61 # cluster centroid 3 (the third of the 4 initially infected patients) 96,14 # cluster centroid 4 (the fourth of the 4 initially infected patients) 83,13 # the remaining 96 (N = 100 minus k = 4) patients of the form x,y 81,32 In a file named kmeans.py, please do the following: 1. Begin by opening kmeans.txt for reading the first 3 fields line-by-line. You'll then want to use the values to dynamically build your data structures for things like your clusters and their corresponding centroids in Step 2. A tuple or list will make the most sense but you'll want to think about the differences between before choosing. There is a solution involving both, though, so feel free to get started using whichever data structure you think makes the most sense. If at some point you decide you want to switch from one data structure to the other, it should only require minimal changes to your code, so no worries there. 2. Create any data structures necessary to store your points, clusters, previous cluster sizes, etc. 3. Create a processing loop to do the following: a. For each point in your list: i. Compute the distance between the current patient and each of the k centroids ii. Add the patient to the cluster to which the nearest centroid belongs b. For each cluster, check its size from the previous iteration (see Section c.iv) against the size of the clusters computed directly above in step (a) i. If any of the cluster sizes changed from the previous iteration, increment a variable that counts the number of iterations required to achieve convergence For each cluster: C. i. Compute the mean of all x values ii. Compute the mean of all y values iii. Update the cluster's centroid with the results from (i) and (ii) iv. Store/update the current size of each cluster in a list For each input file, points1.txt and points2.txt, please output the following: 1. A list of the 4 initial COVID-19 patients (centroids) 2. The number of iterations required to obtain stability 3. The values for the 4 final centroids 4. For each centroid print: a. The centroid ID followed by the final number of patients in the centroid's cluster A list of the "patients" (points) contained in the cluster

Related Book For answer-question

Modern Systems Analysis And Design

ISBN: 9780134204925

8th Edition

Authors: Joseph Valacich, Joey George

See More Books

Posted Date: Jul 25, 2023 02:42 AM

See More Questions

The goal of this assignment is to try to figure out from which of the 4...

Question:

Expert Answer:

Heres a Python implementation of the kmeans algorithm based on the provided requirements python def euclideandistancepoint1 point2 Calculate the Eucli... View the full answer

Modern Systems Analysis And Design

Students also viewed these programming questions