New Semester
Started
Get
50% OFF
Study Help!
--h --m --s
Claim Now
Question Answers
Textbooks
Find textbooks, questions and answers
Oops, something went wrong!
Change your search query and then try again
S
Books
FREE
Study Help
Expert Questions
Accounting
General Management
Mathematics
Finance
Organizational Behaviour
Law
Physics
Operating System
Management Leadership
Sociology
Programming
Marketing
Database
Computer Network
Economics
Textbooks Solutions
Accounting
Managerial Accounting
Management Leadership
Cost Accounting
Statistics
Business Law
Corporate Finance
Finance
Economics
Auditing
Tutors
Online Tutors
Find a Tutor
Hire a Tutor
Become a Tutor
AI Tutor
AI Study Planner
NEW
Sell Books
Search
Search
Sign In
Register
study help
business
nonparametric statistical inference
An Elementary Introduction To Statistical Learning Theory 1st Edition Sanjeev Kulkarni - Solutions
True or False: In SVMs, using linear rules in the transformed space can provide nonlinear rules in the original feature space.
It is sometimes suggested that in selecting a hypothesis on the basis of data, one should balance the empirical error of hypotheses against their simplicity. Does statistical learning theory provide a critique of that suggestion?
What is the instrumentalist conception of theories?
Is it reasonable for you to believe that you are not a brain in a vat being given the experiences of an external world? Explain your answer.
How should scientists measure the simplicity of hypotheses? Should a scientist favor simpler hypotheses over more complex hypotheses that fit the data equally well? Are there alternatives?
Is it reasonable to use simplicity to decide among hypotheses that could not be distinguished by any possible evidence?
Does the hypothesis that all emeralds are grue imply that emeralds will change color in 2050?
How could there be nondenumerably many hypotheses? Could a nondenumerable set of things be well ordered?
Critically assess the following argument. “The hypothesis that the world is just a dream is simpler than the hypothesis that there really are physical objects. The two hypotheses account equally well for the data. So, it is more reasonable to believe that the world is just a dream than that there
Suppose that competing hypotheses h1 and h2 fit the data equally well but h1 is simpler than h2. Does a policy of accepting h1 rather than h2 in such a case have to assume that the world is simple? Why or why not?
Consider the class F of all nondecreasing functions of a single variable x. That is, a function f (x) belongs to F if f (x2) ≥ f (x1) whenever x2 > x1. What is the pseudo-dimension of F ? Justify your answer.
a(a) If we let F be the set of all piecewise constant functions (with no bound on the number of constant pieces and where the value of F may increase or decrease from one piece to the next), what is the pseudo-dimension of F?(b) Do you think this class of functions is PAC learnable? Why or why
Consider the class F consisting of all piecewise constant functions with two pieces. That is, f ∈ F if for some constants α1, α2, β, f (x) is of the form f (x) =α1 x ≤ βα2 x > β.What is the pseudo-dimension of F?
What is the pseudo-dimension of all constant functions? That is, f ∈ F if f (x)is of the form f (x) = α for some constant α.
In function estimation (as opposed to classification), finiteness of which of the following is sufficient for PAC learning? VC-dimension; pseudo-dimension;Popper dimension; dimension-X; fractal dimension; feature space dimension;Hausdorff dimension.
True or False: In the function estimation problem, if x is the observed feature vector and y is a real value that you wish to estimate, the optimal Bayes decision rule (for least squared error) dictates that we estimate y according to the maximum posterior probability, that is, find the maximum of
True or False: The Bayes error rate for estimation problems (as opposed to classification problems) can be greater than 1/2.
a(a) Suppose we consider the function estimation problem (rather than the classification problem), but insist on using the probability of error as our success criterion (rather than squared error). That is, we “make an error”whenever our estimate f (x) is not equal to y. What is the smallest
Discuss the quip “Even a stopped clock is right twice a day,” and contrast it with the addition that “and a working clock is almost never exactly right.”
If a loss function other than squared error is used, will the regression function(i.e., the conditional mean of y, given x) still always be the best estimator?
If one learning algorithm works with a class of rules that includes all the rules of a different algorithm and some others, does this mean that the former algorithm should give a strictly better performance in a learning task? Why or why not? Discuss any advantages/disadvantages to be had by
One might argue that every learning algorithm works with a fixed set of decision rules, namely, the set of all rules that the particular algorithm might possibly produce over all possible observations. In light of such an argument, what are the advantages of explicitly specifying a class of rules C
If the correct classification of an item is completely determined by its observable features, what is the Bayes error rate for decision rules using those features? In this case, does it follow that C is PAC learnable?
a(a) In PAC learning, suppose we let the learner choose the feature points he/she would like classified instead of providing randomly drawn examples.Suggest a way in which you might measure how much “learning” has taken place in this case, that is, what sort of performance criterion might you
In PAC learning, discuss the advantages and disadvantages of requiring that the same sample size m(, δ) work for every choice of prior probabilities and conditional densities. How might we modify the definition of PAC learning model to relax this requirement?
Consider the class C of all convex subsets of the plane. What is VCdim(C)?Justify your answer. (Hint: think about points on the circumference of a circle.)
Consider the class C of all orthogonal rectangles in the plane, that is, all rectangles whose sides are parallel to the coordinate axes. What is VCdim(C)?Justify your answer.
Is the class of rules representable by a perceptron PAC learnable?
What relations of the form X ≤ Y hold between R∗, R(h) ˆ , and R∗C?
If C is a class of decision rules, what are R∗, R(h) ˆ , and R∗C?
If C is a class of decision rules, what condition on VCdim(C) is needed for PAC learnability?
If VCdim(C) = v, what is the smallest number of rules that the class C could contain?
True or False: VCdim(C) is one plus the number of parameters needed to specify a particular rule from the class C.
True or False: VCdim(C) is the largest integer v such that every set of v points can be shattered by C.
True or False: A set of k points is shattered by a class of rules C if all 2k labelings of the points can be generated using rules from C.
a(a) Describe as precisely as you can what it means for C to be PAC learnable, explaining the roles of and δ and the requirements on the sample size.(b) Why do we settle for R∗C instead of R∗ and why do we introduce and δ?
If one learning algorithm works with a class of rules that includes all the rules of a different algorithm and some others, does this mean that the former algorithm should give a strictly better performance in a learning task? Why or why not?Discuss any advantages/disadvantages to be had by
One might argue that every learning algorithm works with a fixed set of decision rules, that is to say, the set of all rules that the particular algorithm might possibly produce over all possible observations. In light of such an argument, is there really anything new to the perspective of working
Let C be a class of decision rules and let h be the hypothesis produced by a learning algorithm. Write the condition on the error rate R(h) for PAC learnability in terms of R∗C, , and δ.
What is the class of rules that can be represented by a perceptron?
Is the process of reasoning toward reflective equilibrium analogous to the algorithm of gradient descent that is used to train a neural network?
Explain why the sort of learning rule discussed in this chapter is appropriately called “backpropagation.”
Why is the weighted input to a unit in a feedforward network passed through a sigmoid function rather than a simple threshold function?
What is “gradient descent” and what is a potential problem with it?
True or false: The backpropagation learning method for feedforward neural networks will always find a set of weights that minimizes the error on the training data.
True or false: The backpropagation learning method requires that the units in the network have sharp thresholds.
Consider a classification problem in which each instance consists of d features x1,...,xd , each of which can only take on the values 0 or 1. A feature vector belongs to class 0 if x1 + x2 +···+ xd is even (i.e., the number of 1’s is even)and it belongs to class 1 otherwise. Can this problem
Explain the sense in which for any decision rule there is a three-layer network that approximates that rule. Sketch a proof of this.
What is a convex set?
Consider the XOR problem with two inputs x1 and x2. That is, the ouput is 1 if exactly one of x1, x2 is positive, and the output is 0 otherwise. Construct a simple three-layer network to solve this problem.
How is a perceptron trained? What features of the perceptron typically change during training? Formulate a learning rule for making such changes and explain how it works. (a) What is a single perceptron?(b) What is a multilayer feedforward neural network?
Consider the following network. There are four inputs with real values. Each input is connected to each of two perceptrons on the first layer that do not take thresholds but simply output the sum of the weighted products of their inputs.Each of these perceptrons is connected to an output threshold
As in the previous problem, consider a classification problem in which each instance consists of d features x1,...,xd , each of which can take on only the values 0 or 1. A feature vector belongs to class 0 if x1 + x2 +···+ xd is even(i.e., the number of 1’s is even) and it belongs to class 1
Consider a classification problem in which each instance consists of d features x1,...,xd , each of which can take on only the values 0 or 1. Come up with a linear threshold unit (a single perceptron) that functions as an AND gate (i.e., the output is 1 if all of the inputs xi are 1, and the output
Design a linear threshold unit with two inputs that outputs the value 1 if and only if the first input has a greater value than the second. (What are the weights on the inputs and what is the threshold?)
Consider a linear threshold unit (perceptron) with three inputs and one output.The weights on the inputs are respectively 1, 2, and 3, and the threshold is 0.5.If the inputs are respectively 0.1, 0.2, and 0.3, what is the output? If all the three inputs are 0.1, what is the output?
a(a) Sketch a diagram of a perceptron with three inputs x1, x2, x3 and weights w1, w2, w3. Label the inputs, weights, and output.(b) Write the expression for the output in terms of the inputs and the weights, assuming a threshold of 0.(c) What is the output when the inputs are −3, 2, 1 and the
Repeat parts (a), (b), and (c) of the previous problem for the triangular kernel.For part (d), find the value h for which the classification rule first decides 0 for all x.
Consider the special case where we have a 1-dimensional feature vector and are interested in using a kernel rule. Suppose we have the training data (0,0),(1,1), and (3,0), and we use the simple moving window classifier (i.e., with kernel function K(x) = 1 for |x| ≤ 1 and K(x) = 0 otherwise).(a)
True or False: There might be some set of labeled data (training examples)such that the 1-nearest neighbor method and the Kernel method (for some K(x) of your choice) can give exactly the same decision for any observed feature vector.
True or False: The decision rules arising from using a kernel method with two different kernel functions, K1(x) and K2(x) = 2K1(x), are exactly the same.
What conditions are required on the smoothing parameter hn for a kernel rule to be universally consistent?
For the kernel function of the previous problem, smoothing factor h = 0.5 and xi = 3, sketch K( x−xi h ).
For a one-dimensional feature x, write the equation for the triangular kernel function, K(x).
For a one-dimensional feature x, sketch the simple kernel function K(x) =I{|x|≤1}.
Write the expression for the vote count for class 0, v0 n (x) in terms of the data(x1, 1), . . . , (xn, n), indicator functions, and the kernel K(x).
For the simplest kernel classification rule, what choices of the smoothing parameter h are analogous to selecting kn = 1 and kn = n, respectively, in the nearest neighbor rule? What happens to the error in these extreme cases?
Briefly discuss the following position. Under appropriate conditions, the kn-NN rule is universally consistent, so the choice of features does not matter.
If we use a kn-NN rule with kn = n, what would be the resulting error rate in terms of P (0), P (1), P (x|0), and P (x|1)?
What conditions are required on kn for the kn-NN rule to be universally consistent?
Describe as precisely as you can the tradeoffs of having a small kn versus a large kn in the kn-nearest neighbor classifier. What happens in the extreme cases when kn = 1 and when kn = n?
Come up with a case (i.e., give the prior probabilities and conditional densities)in which the error rate of the NN rule equals the Bayes error rate, and briefly explain why this happens in the case you give.
Recall that for the 1-NN rule, the region associated with a feature vector xi is the set of all points that are closer to xi than to any of the other feature vectors xj for j = i. These are the Voronoi regions. Sketch the Voronoi regions for feature vectors x1 = (0, 0), x2 = (0, 2), x3 = (2, 0),
a(a) What is the NN rule and how does the expected error from the use of this rule compare with the Bayes error rate?(b) What conditions on kn are required for the kn-NN rule to have an asymptotic error rate equal to the Bayes error rate?
Will the remaining training data be independent? Identically distributed according to the underlying distributions?(d) If we use method M but on the training data as modified in part (c) by throwing away all examples with label 1, what asymptotic error rate will we get?(e) Give a very brief
Consider a pattern recognition problem with prior probabilities P (0) = 0.2 and P (1) = 0.8 and conditional distributions P (x|0) and P (x|1). Let R∗ denote the Bayes error rate for this problem. As usual, suppose we have independent and identically distributed training data (x1, 1), . . . , (xn,
What is inductive bias? Is it good or bad?
What is the curse of dimensionality?
When might a brute force approach to the learning problem be useful?
What does it mean for examples to be drawn to be iid?
What sorts of assumptions do we need to make about the training data?
What are training data for learning from examples?
Why do we need learning? Why cannot we simply use a Bayes rule for pattern classification?
What is the general learning problem we are concerned with?
True or false? P (A|B) might have a definite value even if P (B) = 0.
Suppose the feature vector x can only take values −1 or +1, and the Bayes rule is as follows: decide 0 when x = −1 and decide 1 when x = +1. Suppose the cost of a correct decision is 0, but the cost of an incorrect decision is as follows. The cost of deciding 0 when actual class is 1 is $5 and
Suppose that 5% of women and 0.25% of men in a given population are colorblind.(a) Write Bayes Theorem.(b) If a colorblind person is chosen at random from a population containing an equal number of males and females, what is the probability that the chosen colorblind person is male?
A die is weighted so that when it is tossed, one particular number will come up with a probability of 1/2 and each of the five other numbers will come up with a probability of 1/10. Suppose that each of the six sides has had an equal chance of being the side that is favored. We are interested in
Suppose we know that the probability that it will rain on any given day is p and is independent of the weather on all other days. As a meteorologist, what should your pattern of weather predictions be from day to day to minimize your probability of error, and what is the resulting error rate?
Suppose there are two curtains and behind each curtain is either gold or a goat.Suppose the prior probability that there is gold behind a given curtain is 1/2, is the same for each of the curtains, and is independent of what is behind the other curtain.(a) If you are told that there is gold behind
You experiment by withdrawing a ball, checking its color, and returning it. Given that the ball is white, what is the probability that the container before you is container 1? If you repeat the experiment three times, What is the probability of getting the result white-black-white?
There are two containers of black and white balls. In container 1, 1/3 of the balls are black and 2/3 are white. In container 2, 2/3 of the balls are black and 1/3 are white. The container before you is equally likely to be container 1 or container
Continuing with the cost assumptions of the previous problem.(a) In general, when you observe a feature x, what is the average cost if you decide 0? What is it if you decide 1? (These should be dependent on x.)(b) On the basis of these average costs, to minimize the average cost, when should you
Consider the situation in question 4 and assume now that it costs $12 if you decide 0 when it really was a 1 and $10 if you decide 1 when it really was a 0.(a) If you use the Bayes decision rule that you derived for question 4, how much does it cost you to make a decision on average?(b) Imagine
For the situation in question 4, consider two decision makers, Ivan and Derek, who make a decision solely on the basis of prior probabilities (i.e., without observing any features). Ivan thinks that the decision should be randomized according to the prior probability distribution. So, he decides
Consider two classes 0 and 1 with P (0) = 0.6. Suppose that the conditional probability density functions of the feature x are p(x|0) = 1 for 0 ≤ x ≤ 1 and p(x|1) = 2x for 0 ≤ x ≤ 1.(a) What is the feature space?(b) What are the possible decision rules?(c) What is the optimal (Bayes)
True or false? If the features, classifications, and probabilities are fixed, there can be more than one Bayes rule.
What is a Bayes rule?
What is Bayes theorem? Why is it true? How might Bayes theorem be useful in determining the probability of a hypothesis, given some evidence?
Are there uncountably many rational numbers?
Showing 1200 - 1300
of 5397
First
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Last
Step by Step Answers