Q2. Perceptrons Instead of using Naive Bayes, you decide to try applying Perceptron to the interrogation...

Fantastic news! We've Found the answer you've been seeking!

Question:

Transcribed Image Text:

Q2. Perceptrons Instead of using Naive Bayes, you decide to try applying Perceptron to the interrogation data. You generate features from the training data as follows: • 91(x) = n where "not" appears n times. • v2(x) = m where "swear" appears m times. • P3(x) = 1, a bias term We use the labels +1 for G and -1 for 1. Given a weight vector w = (wi, W2, w3), our classifier returns +1 if Wi91 (x) + W242(x) + w3 >= 0 and -1 otherwise. Our training set from part 1 yields the following features and labels: Training Statements I am definitely innocent, officer Officer, I swear I am not lying I am not lying, I swear I am innocent, officer, I swear Officer, I am definitely not lying 1 Label P3 +1 1 1 +1 1 +1 1 1 -1 -1 (a) [2 pts] Compute the first two updates of the Perceptron algorithm and fill in the following table, using the given initial Perceptron weights w = (w1, w2, w3) and data points (P1, 42 P3. Label). W1 W2 W3 Initial 2 -0.5 Observing (0, 0, 1, +1) Observing (1, 0, 1, -1) (b) [2 pts] What convergence guarantees can you give for the Perceptron algorithm applied to this data set? (c) [2 pts] Linear classifiers are often insufficient to represent a dataset using a given set of features. However, it is often possible to find new features using nonlinear functions of our existing features which do allow linear classifiers to separate the data. Nonlinear features result in more expressive linear classifiers. For example, consider the following data set, where +'s represent positive examples and's represent negative examples. -1 +1 No linear classifier can separate the positive examples (-1, 0) and (1, 0) from the negative example (0,0). Rather than using a single feature, if we perform a nonlinear mapping o(x1, x2) = (x², 1}, the positive examples are both mapped to (1, 1) and the negative example is mapped to (0, 1), and we see the data can be separatedby a linear classifier. One example is the line w [1, -0.5], i.e. the classifier w'o(x) = x2 - 0.5 >= 0. %3D + +! +1 For what values of the weight vector w (wi, w2) does the classifier w'o(x) >= 0 separate the given data? d) [3 pts] Which of the following feature sets allows a linear classifier w = (w1, w2 w3) to separate the original interrogation data set? Justify your answer briefly. %3! () [1 pt] p'= (φ1 + P2 P1 - P2 1) %3D (ii) [1 pt] o' = (p192 oz 1) %3D (ii) [1 pt] o' = ((ı xor o2), q2 1) where a xor b is 1 if either a = 1 or b = 1 but not both. (e) [2 pts] Given the features o(x) = [x?, x, 1], how many data points are we guaranteed to be able to separate with zero error using a linear classifier w'o(x) = wix? + w2x + w3? Assume that a data pointx cannot have conflicting labels. Justify your answer briefly. () [2 pts] In general, if we use features o(x) = [xN-1, xN-2 ., 1], ie. an N - 1th order polynomial, how many points can we separate with zero error using a linear classifier w = [w,., WN ]? Justify your answer briefly. (g) [2 pts] Assume we have N labeled training data points, which we would like to use for classification i.e. to predict the labels of unseen test data points. What are the disadvantages of using an Nth order polynomial to fit this data? Q2. Perceptrons Instead of using Naive Bayes, you decide to try applying Perceptron to the interrogation data. You generate features from the training data as follows: • 91(x) = n where "not" appears n times. • v2(x) = m where "swear" appears m times. • P3(x) = 1, a bias term We use the labels +1 for G and -1 for 1. Given a weight vector w = (wi, W2, w3), our classifier returns +1 if Wi91 (x) + W242(x) + w3 >= 0 and -1 otherwise. Our training set from part 1 yields the following features and labels: Training Statements I am definitely innocent, officer Officer, I swear I am not lying I am not lying, I swear I am innocent, officer, I swear Officer, I am definitely not lying 1 Label P3 +1 1 1 +1 1 +1 1 1 -1 -1 (a) [2 pts] Compute the first two updates of the Perceptron algorithm and fill in the following table, using the given initial Perceptron weights w = (w1, w2, w3) and data points (P1, 42 P3. Label). W1 W2 W3 Initial 2 -0.5 Observing (0, 0, 1, +1) Observing (1, 0, 1, -1) (b) [2 pts] What convergence guarantees can you give for the Perceptron algorithm applied to this data set? (c) [2 pts] Linear classifiers are often insufficient to represent a dataset using a given set of features. However, it is often possible to find new features using nonlinear functions of our existing features which do allow linear classifiers to separate the data. Nonlinear features result in more expressive linear classifiers. For example, consider the following data set, where +'s represent positive examples and's represent negative examples. -1 +1 No linear classifier can separate the positive examples (-1, 0) and (1, 0) from the negative example (0,0). Rather than using a single feature, if we perform a nonlinear mapping o(x1, x2) = (x², 1}, the positive examples are both mapped to (1, 1) and the negative example is mapped to (0, 1), and we see the data can be separatedby a linear classifier. One example is the line w [1, -0.5], i.e. the classifier w'o(x) = x2 - 0.5 >= 0. %3D + +! +1 For what values of the weight vector w (wi, w2) does the classifier w'o(x) >= 0 separate the given data? d) [3 pts] Which of the following feature sets allows a linear classifier w = (w1, w2 w3) to separate the original interrogation data set? Justify your answer briefly. %3! () [1 pt] p'= (φ1 + P2 P1 - P2 1) %3D (ii) [1 pt] o' = (p192 oz 1) %3D (ii) [1 pt] o' = ((ı xor o2), q2 1) where a xor b is 1 if either a = 1 or b = 1 but not both. (e) [2 pts] Given the features o(x) = [x?, x, 1], how many data points are we guaranteed to be able to separate with zero error using a linear classifier w'o(x) = wix? + w2x + w3? Assume that a data pointx cannot have conflicting labels. Justify your answer briefly. () [2 pts] In general, if we use features o(x) = [xN-1, xN-2 ., 1], ie. an N - 1th order polynomial, how many points can we separate with zero error using a linear classifier w = [w,., WN ]? Justify your answer briefly. (g) [2 pts] Assume we have N labeled training data points, which we would like to use for classification i.e. to predict the labels of unseen test data points. What are the disadvantages of using an Nth order polynomial to fit this data?

Related Book For answer-question

answer-question

International Business Law and Its Environment

International Business Law and Its Environment

ISBN: 978-0324649659

7th Edition

Authors: Richard schaffer, Filiberto agusti, Beverley earle

See More Books

Posted Date: May 19, 2022 07:46 AM

See More Questions