2. (35 points) Effect of constraints on optimal solutions. A key result that optimal classifiers pick...

Fantastic news! We've Found the answer you've been seeking!

Question:

Transcribed Image Text:

2. (35 points) Effect of constraints on optimal solutions. A key result that optimal classifiers pick the most probable class, which defines Bayes optimality. One of the consequences is that optimal decisions are pure or crisp, they don't weight or mix decisions. But is this result always true? Here we show how constraints can make the answer to this question NO by understanding how they change the nature of an optimal solution. One of the key ideas in the course is that the knowledge we have about the structure of our data also constitutes data that we can encode as constraints. A simple example is when a variable has upper and/or lower bounds. As a concrete example, we might want to choose the best product (e.g. pizza) for a discrete set of use cases (e.g. food restriction types like vegatarian =1, gluten-free =2, dairy-free =3, etc.) for the lowest price, given a dataset with features x = [Xitem, Xcust, Xprice] labeled by the best use case y = {0, 1, 2, . }. Consider a N-class classification problem with features x Rd and one-hot encoded labels y = ek, where ex are unit vectors with 1 at component k and zero elsewhere, and k = {1, ..., N}. Assume the data is D-distributed (x, y) ~ D, where D is a fixed (but unknown) distribution on Rd {0, 1, 2, }. Assume p(y = ek) = k such that kak = 1. Consider the classifier given by f(x) = {k : P(y = k | x) jkP(y = ej | x)) V(j k)} The standard classification loss is the error rate, given by L () = P (f(x) y) = E(x,y)~D [1(f(x) y)], where 1 is the indicator function (for 0 - 1 loss), and L(.) is the expected 0 - 1 loss or true error rate. Professor Bayes claims the following for any other classifier function g: Rd {0, 1, 2, ... }, we have L () L (g), which is the definition of Bayes optimal for the proper choice of Bk. The result is standard and easy to find. (d) (10 points) One of the key constraints in the problem could be that customers have budgets. Let's simplify the problem to see what goes wrong. Assume you have two pizza options in different classes but both satisfy the customer's food restrictions. One option has better value, but they come in a discrete set of unit sizes (say two), and have a discrete set of costs. If the customer has a budget constraint in terms of not exceeding a total cost threshold, show it is possible for the option with the worse value to be the better choice (assuming eating pizza is better than starving) if the unit sizes are smaller for the higher cost pizza. (a) (5 points) Using the notation above and your own words explain why Professor Bayes claim is true and determine what 3jk must be. (b) (10 points) For part b, consider the following scenario. You have been given pizza data as described above and trained the optimal classifier. However, you find that the classifier performs very poorly on test data. Because you are good with data, you go back to the raw data and find that it includes two pieces of information about the customers that were left out: the customer's daily income and expenses, which you encode into a new feature vector Zcust. These features were left out because they were found to be independent of the customer's food restriction type and the pizza option features (value, restriction type) are independent of the customer's income and expenses. In addition, you discover that the labeled data was generated by actual customers using a drop-down menu to select their food restrictions and then selected the pizza option that was "the best value" from a list of options which satisfied the indicated restriction. Value information is given in terms of cost for a standarized slice. If you include the daily income and expense values into the feature vector, you find that the resulting classifier is significantly better with near perfect performance. Explain what could have gone wrong with the original classifier to produce these results and how we could modify the data collection task and data labeling to better reflect the problem domain. In your explanation, convert the information described in the scenario into conditional probabilities over the variables Xcust, Xprice, Zeust, Scust, Ycust,option, Soption, where Scust are sets of restriction types needed for each customer, and Soption are sets of restriction types satisfied by each pizza option, and Ycust,option are the restriction type of the option each customer selected as the best value for them. (c) (10 points) There is an apparent paradox between the result in part a and the scenario in part b (although hopefully your analysis has uncovered a resolution). To understand the paradox, assume that the classifier f is Bayes optimal and g is different from f. If the true labels in these regions are k, then we should always pick classifier f, which is justified on the basis of the logic of best guessing: observing a label at a point x in the feature space is like observing a biased random variable (like a coin, dice, etc.) that comes up with value k with higher probability than the other options. The best guess (in terms of minimum error) is to always select k which succeeds with probability k. Otherwise, each time we select a different option j k, our guesses succeed with a lower probability p;. Show that choosing j for a fraction y of the total choices results in an error rate of yp; + (1 -y)P and use the fact that this is a convex combination to prove that the maximal success rate is Pk. The general case is that there must be some region of the feature space where g is assigning different class values. Using the notation R(k) = {x : k = f(x)} to represent the set inverse of the classifier f for output k, then there are differences AjkR = R(k) | R,(j) and some of the A/R 0. Illustrate the idea with an appropriate drawing. 2. (35 points) Effect of constraints on optimal solutions. A key result that optimal classifiers pick the most probable class, which defines Bayes optimality. One of the consequences is that optimal decisions are pure or crisp, they don't weight or mix decisions. But is this result always true? Here we show how constraints can make the answer to this question NO by understanding how they change the nature of an optimal solution. One of the key ideas in the course is that the knowledge we have about the structure of our data also constitutes data that we can encode as constraints. A simple example is when a variable has upper and/or lower bounds. As a concrete example, we might want to choose the best product (e.g. pizza) for a discrete set of use cases (e.g. food restriction types like vegatarian =1, gluten-free =2, dairy-free =3, etc.) for the lowest price, given a dataset with features x = [Xitem, Xcust, Xprice] labeled by the best use case y = {0, 1, 2, . }. Consider a N-class classification problem with features x Rd and one-hot encoded labels y = ek, where ex are unit vectors with 1 at component k and zero elsewhere, and k = {1, ..., N}. Assume the data is D-distributed (x, y) ~ D, where D is a fixed (but unknown) distribution on Rd {0, 1, 2, }. Assume p(y = ek) = k such that kak = 1. Consider the classifier given by f(x) = {k : P(y = k | x) jkP(y = ej | x)) V(j k)} The standard classification loss is the error rate, given by L () = P (f(x) y) = E(x,y)~D [1(f(x) y)], where 1 is the indicator function (for 0 - 1 loss), and L(.) is the expected 0 - 1 loss or true error rate. Professor Bayes claims the following for any other classifier function g: Rd {0, 1, 2, ... }, we have L () L (g), which is the definition of Bayes optimal for the proper choice of Bk. The result is standard and easy to find. (d) (10 points) One of the key constraints in the problem could be that customers have budgets. Let's simplify the problem to see what goes wrong. Assume you have two pizza options in different classes but both satisfy the customer's food restrictions. One option has better value, but they come in a discrete set of unit sizes (say two), and have a discrete set of costs. If the customer has a budget constraint in terms of not exceeding a total cost threshold, show it is possible for the option with the worse value to be the better choice (assuming eating pizza is better than starving) if the unit sizes are smaller for the higher cost pizza. (a) (5 points) Using the notation above and your own words explain why Professor Bayes claim is true and determine what 3jk must be. (b) (10 points) For part b, consider the following scenario. You have been given pizza data as described above and trained the optimal classifier. However, you find that the classifier performs very poorly on test data. Because you are good with data, you go back to the raw data and find that it includes two pieces of information about the customers that were left out: the customer's daily income and expenses, which you encode into a new feature vector Zcust. These features were left out because they were found to be independent of the customer's food restriction type and the pizza option features (value, restriction type) are independent of the customer's income and expenses. In addition, you discover that the labeled data was generated by actual customers using a drop-down menu to select their food restrictions and then selected the pizza option that was "the best value" from a list of options which satisfied the indicated restriction. Value information is given in terms of cost for a standarized slice. If you include the daily income and expense values into the feature vector, you find that the resulting classifier is significantly better with near perfect performance. Explain what could have gone wrong with the original classifier to produce these results and how we could modify the data collection task and data labeling to better reflect the problem domain. In your explanation, convert the information described in the scenario into conditional probabilities over the variables Xcust, Xprice, Zeust, Scust, Ycust,option, Soption, where Scust are sets of restriction types needed for each customer, and Soption are sets of restriction types satisfied by each pizza option, and Ycust,option are the restriction type of the option each customer selected as the best value for them. (c) (10 points) There is an apparent paradox between the result in part a and the scenario in part b (although hopefully your analysis has uncovered a resolution). To understand the paradox, assume that the classifier f is Bayes optimal and g is different from f. If the true labels in these regions are k, then we should always pick classifier f, which is justified on the basis of the logic of best guessing: observing a label at a point x in the feature space is like observing a biased random variable (like a coin, dice, etc.) that comes up with value k with higher probability than the other options. The best guess (in terms of minimum error) is to always select k which succeeds with probability k. Otherwise, each time we select a different option j k, our guesses succeed with a lower probability p;. Show that choosing j for a fraction y of the total choices results in an error rate of yp; + (1 -y)P and use the fact that this is a convex combination to prove that the maximal success rate is Pk. The general case is that there must be some region of the feature space where g is assigning different class values. Using the notation R(k) = {x : k = f(x)} to represent the set inverse of the classifier f for output k, then there are differences AjkR = R(k) | R,(j) and some of the A/R 0. Illustrate the idea with an appropriate drawing.

Related Book For answer-question

answer-question

Operations Research An Introduction

Operations Research An Introduction

ISBN: 978-0132555937

9th edition

Authors: Hamdy A. Taha

See More Books

Posted Date: Feb 07, 2024 11:43 AM

See More Questions