The goal of this assignment is to try to figure out from which of the 4...
Fantastic news! We've Found the answer you've been seeking!
Question:
Transcribed Image Text:
The goal of this assignment is to try to figure out from which of the 4 initial COVID-19 patients the other 96 are most like to have contracted the virus. You will read a set of parameters as defined in the "Input File Format" section above as well as a number of points from a file and perform naïve k-means clustering on them. Your final output will include the number of iterations required to achieve convergence and, for each cluster, the final location of its centroid, the number of points in the cluster, and a list of points contained in that cluster (examples provided in the "Outputs" section). Please note this is just a helpful guide to get you started. Correctness will primarily be based on your outputs (defined below), but your individual implementation may vary. Requirements 1. You will implement your solution in a file called kmeans.py 2. Your solution must implement k-means by implementing the algorithm above and may not leverage external Python libraries to perform the clustering 3. Since we are processing multiple input files, your solution should include a list containing the two input file names (points1.txt and points2.txt) and iterate over both files within a single execution of your program. Input File Format The input files (points1.txt, points2.txt) will consist of the maximum number of clustering iterations at the beginning of the input file, followed by N, the total number of points (patients) in the input file, followed by k, the number of clusters (initially infected patients), followed by k centroids (initially infected patients), followed by N-k patients' locations, each on a separate line. The coordinates for all points are integers and are separated by commas. When reading the data in from the 2 files you may find string methods such as strip() and split() useful. Here is a (truncated) overview of the file contents for points1.txt. The comments are provided here for clarity but are not included in the actual input files. 50 # max. number of iterations 100 # number of patients in input file (COVID-19 patient locations; 4 initially infected (initial cluster centroids), 96 who contracted the virus from those 4, for a total of 100) 4 # total number of clusters (k, the number of initially infected patients) 30,45 # cluster centroid 1 (the first of the 4 initially infected patients) 55,82 # cluster centroid 2 (the second of the 4 initially infected patients) 61 # cluster centroid 3 (the third of the 4 initially infected patients) 96,14 # cluster centroid 4 (the fourth of the 4 initially infected patients) 83,13 # the remaining 96 (N = 100 minus k = 4) patients of the form x,y 81,32 In a file named kmeans.py, please do the following: 1. Begin by opening kmeans.txt for reading the first 3 fields line-by-line. You'll then want to use the values to dynamically build your data structures for things like your clusters and their corresponding centroids in Step 2. A tuple or list will make the most sense but you'll want to think about the differences between before choosing. There is a solution involving both, though, so feel free to get started using whichever data structure you think makes the most sense. If at some point you decide you want to switch from one data structure to the other, it should only require minimal changes to your code, so no worries there. 2. Create any data structures necessary to store your points, clusters, previous cluster sizes, etc. 3. Create a processing loop to do the following: a. For each point in your list: i. Compute the distance between the current patient and each of the k centroids ii. Add the patient to the cluster to which the nearest centroid belongs b. For each cluster, check its size from the previous iteration (see Section c.iv) against the size of the clusters computed directly above in step (a) i. If any of the cluster sizes changed from the previous iteration, increment a variable that counts the number of iterations required to achieve convergence For each cluster: C. i. Compute the mean of all x values ii. Compute the mean of all y values iii. Update the cluster's centroid with the results from (i) and (ii) iv. Store/update the current size of each cluster in a list For each input file, points1.txt and points2.txt, please output the following: 1. A list of the 4 initial COVID-19 patients (centroids) 2. The number of iterations required to obtain stability 3. The values for the 4 final centroids 4. For each centroid print: a. The centroid ID followed by the final number of patients in the centroid's cluster A list of the "patients" (points) contained in the cluster The goal of this assignment is to try to figure out from which of the 4 initial COVID-19 patients the other 96 are most like to have contracted the virus. You will read a set of parameters as defined in the "Input File Format" section above as well as a number of points from a file and perform naïve k-means clustering on them. Your final output will include the number of iterations required to achieve convergence and, for each cluster, the final location of its centroid, the number of points in the cluster, and a list of points contained in that cluster (examples provided in the "Outputs" section). Please note this is just a helpful guide to get you started. Correctness will primarily be based on your outputs (defined below), but your individual implementation may vary. Requirements 1. You will implement your solution in a file called kmeans.py 2. Your solution must implement k-means by implementing the algorithm above and may not leverage external Python libraries to perform the clustering 3. Since we are processing multiple input files, your solution should include a list containing the two input file names (points1.txt and points2.txt) and iterate over both files within a single execution of your program. Input File Format The input files (points1.txt, points2.txt) will consist of the maximum number of clustering iterations at the beginning of the input file, followed by N, the total number of points (patients) in the input file, followed by k, the number of clusters (initially infected patients), followed by k centroids (initially infected patients), followed by N-k patients' locations, each on a separate line. The coordinates for all points are integers and are separated by commas. When reading the data in from the 2 files you may find string methods such as strip() and split() useful. Here is a (truncated) overview of the file contents for points1.txt. The comments are provided here for clarity but are not included in the actual input files. 50 # max. number of iterations 100 # number of patients in input file (COVID-19 patient locations; 4 initially infected (initial cluster centroids), 96 who contracted the virus from those 4, for a total of 100) 4 # total number of clusters (k, the number of initially infected patients) 30,45 # cluster centroid 1 (the first of the 4 initially infected patients) 55,82 # cluster centroid 2 (the second of the 4 initially infected patients) 61 # cluster centroid 3 (the third of the 4 initially infected patients) 96,14 # cluster centroid 4 (the fourth of the 4 initially infected patients) 83,13 # the remaining 96 (N = 100 minus k = 4) patients of the form x,y 81,32 In a file named kmeans.py, please do the following: 1. Begin by opening kmeans.txt for reading the first 3 fields line-by-line. You'll then want to use the values to dynamically build your data structures for things like your clusters and their corresponding centroids in Step 2. A tuple or list will make the most sense but you'll want to think about the differences between before choosing. There is a solution involving both, though, so feel free to get started using whichever data structure you think makes the most sense. If at some point you decide you want to switch from one data structure to the other, it should only require minimal changes to your code, so no worries there. 2. Create any data structures necessary to store your points, clusters, previous cluster sizes, etc. 3. Create a processing loop to do the following: a. For each point in your list: i. Compute the distance between the current patient and each of the k centroids ii. Add the patient to the cluster to which the nearest centroid belongs b. For each cluster, check its size from the previous iteration (see Section c.iv) against the size of the clusters computed directly above in step (a) i. If any of the cluster sizes changed from the previous iteration, increment a variable that counts the number of iterations required to achieve convergence For each cluster: C. i. Compute the mean of all x values ii. Compute the mean of all y values iii. Update the cluster's centroid with the results from (i) and (ii) iv. Store/update the current size of each cluster in a list For each input file, points1.txt and points2.txt, please output the following: 1. A list of the 4 initial COVID-19 patients (centroids) 2. The number of iterations required to obtain stability 3. The values for the 4 final centroids 4. For each centroid print: a. The centroid ID followed by the final number of patients in the centroid's cluster A list of the "patients" (points) contained in the cluster
Expert Answer:
Answer rating: 100% (QA)
Heres a Python implementation of the kmeans algorithm based on the provided requirements python def euclideandistancepoint1 point2 Calculate the Eucli... View the full answer
Related Book For
Modern Systems Analysis And Design
ISBN: 9780134204925
8th Edition
Authors: Joseph Valacich, Joey George
Posted Date:
Students also viewed these programming questions
-
The goal of this assignment is to learn about real-world SDLC failures and apply what we've learned about the SDLC to these "cases". I hope that, by doing this, you will not be involved with (or,...
-
Planning is one of the most important management functions in any business. A front office managers first step in planning should involve determine the departments goals. Planning also includes...
-
Ron Stewart is an accounting major at the University of Technology, Jamaica. He is completing a project relating to the Regulatory Framework for Financial Reporting. Ron Stewart has been reading...
-
What are the types of production systems in aquaculture?
-
A diffraction grating of 2000 slits per centimeter is used to analyze the spectrum of mercury. (a) Find the angular separation in the first-order spectrum of the two lines of wavelength 579.0 and...
-
Risks that have been identified and may or may not happen are referred to as known unknowns, and a _______________ should be established to cover them if they are triggered. a. contingency reserve b....
-
True or False: For personal investment decision making, rates of return are used more frequently than present worth.
-
Carver Department Stores, Inc., constructs its own stores. In the past, no cost has been added to the asset value for interest on funds borrowed for construction. Management has decided to correct...
-
Users complain that a database server has slowed down over the last hour despite the fact that no more users have connected Option. You run an antimalware scan, which comes up clean Option. The...
-
Your client, Schroeder Manufacturing Co., provided the following schedule of property, plant, and equipment for the year ended June 30, 2019. Balances have been agreed to the general ledger. As part...
-
How does the poet employ linguistic ambiguity and metaphorical resonance to evoke the existential conflict between the ephemeral nature of existence and the perennial quest for meaning?
-
Earlier this year, the US Travel Policy Council put out a call to the private sector for input on a new USA National Travel and Tourism Strategy to be published later this year. USA National Travel...
-
solve this question. Option A: Life = 6 years, Initial cost = $28,000, Annual operating cost = $2,000 Option B: Life = 3 years, Initial cost = $23,000, Annual operating cost = $3000 MARR = 9% n=3...
-
California's 2600-mile long system of levees east of San Francisco is arguably the most worrisome infrastructure risk in America-called a "ticking time bomb" by some-whose failure would top the...
-
Faraz receives a weekly salary of $ 6 5 0 . 0 0 , a bonus payment of $ 4 0 0 . 0 0 , an auto allowance $ 5 0 . 0 0 and a non - cash taxable benefit of $ 2 0 . 0 0 on his weekly pay. What his gross...
-
What makes a technology appropriate? Is it appropriate if it does what it was designed to do? Take a knife for example: A knife can be used to slice food, but it can also be used to stab and kill a...
-
Which situation is represented by the inequality 30x+175>=25x+150 ? Stevie earns $30 for each insurance policy he sells plus $175. Daniel earns $25 for each policy she sells plus $150. How many...
-
Research an article from an online source, such as The Economist, Wall Street Journal, Journal of Economic Perspectives, American Journal of Agricultural Economics, or another academic journal. The...
-
In which phase of the SDLC does project planning typically occur? In which phase does project management occur?
-
How is a foreign key represented in relational notation?
-
Describe the steps involved in making a network diagram.
-
Record the following details relating to a carpet retailer for the month of November 2017 and extract a trial balance as at 30 November 2017. 2017 Nov 1 Started in business with 15,000 in the bank. 3...
-
You are to enter up the necessary accounts for the month of October from the following information relating to a small printing firm. Then balance-off the accounts and extract a trial balance as at...
-
What would have been the balance on the account of C. De Freitas in MC17 on 19 May 2017? (A) A debit balance of 265 (B) A credit balance of 95 (C) A credit balance of 445 (D) A credit balance of 265
Study smarter with the SolutionInn App