Question: All answers to problem set questions must be typed so they can be reviewed by Turnitin. Please submit your completed problem set through Canvas

All answers to problem set questions must be typed so they can 

All answers to problem set questions must be typed so they can be reviewed by Turnitin. Please submit your completed problem set through Canvas using the instructions found on the syllabus. Problem 1 (2 points) Distance is a key notion underlying many data mining algorithms, such as k-nearest neighbor (k- NN). Why can it be a problem to compare customers using regular Euclidean distance such as when they are described by age (in years), income (in dollars), and number of credit cards? How can this problem be fixed? It's common for income values to have a much wider range compared to age and credit card numbers, causing income to have a stronger influence on predictions. To address this issue, we can normalize the data. This involves selecting a scale factor for each input attribute and adjusting the inputs along each dimension accordingly. The goal is to ensure that the variance is balanced on every axis. By applying this normalization technique, we can mitigate the dominance of income values and improve the accuracy and fairness of predictions. Problem 2 (3 points) You currently work for Aperture Science, a small company that sells information technology (IT) products. The lone data scientist at Aperture approaches you one day and proposes to use k-NN estimation to build a model to predict the IT budget of companies to identify potential new clients. They would like your help building and deploying the model. The only data you have on hand is a sample of companies across the United States, which includes their IT budget for last year, their total revenue last year, their total number of employees last year, and their industry classification. This data will make up your database of potential neighbors. Ultimately, as a first true test of the model you want predict the IT budget for Acme Corp., a potential client for whom you do not know their IT budget (but you know their total revenue, number of employees, and industry classification). A) Given the information above, explain how you could estimate Acme's IT budget using k- NN. B) If you chose X-N, the total number of training examples, what would be the effect?

Step by Step Solution

3.45 Rating (152 Votes )

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock

Answer Problem 1 Distance is a key notion underlying many data mining algorithms such as knearest neighbor kNN However it can be problematic to compar... View full answer

blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Economics Questions!