New Semester
Started
Get
50% OFF
Study Help!
--h --m --s
Claim Now
Question Answers
Textbooks
Find textbooks, questions and answers
Oops, something went wrong!
Change your search query and then try again
S
Books
FREE
Study Help
Expert Questions
Accounting
General Management
Mathematics
Finance
Organizational Behaviour
Law
Physics
Operating System
Management Leadership
Sociology
Programming
Marketing
Database
Computer Network
Economics
Textbooks Solutions
Accounting
Managerial Accounting
Management Leadership
Cost Accounting
Statistics
Business Law
Corporate Finance
Finance
Economics
Auditing
Tutors
Online Tutors
Find a Tutor
Hire a Tutor
Become a Tutor
AI Tutor
AI Study Planner
NEW
Sell Books
Search
Search
Sign In
Register
study help
business
management and artificial intelligence
Artificial Intelligence A Textbook 1st Edition Charu C. Aggarwal - Solutions
Suppose that your linear regression model shows similar accuracy on the training and test data. How should you modify the regularization parameter?
Suppose that the split at the top level of the decision tree is chosen using a domainspecific condition by a human expert. The splits at other levels are chosen in a datadriven manner. How does the bias and variance of this decision tree compare to that of a purely inductive decision tree?
Suppose that you modify an inductive rule-based classifier into a two-stage classifier.In the first stage, domain-specific rules are used to decide if the test instance matches these conditions. If it does, the classification is performed with the domain-specific rules. Otherwise, the second stage
What effect does the use of Laplacian smoothing in the Bayes classifier have on the bias and variance?
Suppose that a model provides extremely poor (but similar) accuracies on both the training data and on the test data. What are the most likely sources of the error(among bias, variance, and noise)?
Does the bias of a decision tree increase or decrease by reducing the height of the tree via pruning? How about the variance?
Does the bias of a κ-nearest neighbor classifier increase or decrease with increasing value of κ? what happens to the variance? What does the classifier do, when one sets the value of κ to the data size n?
Show how you can perform the steps of the Exercise 5 with the use of stochastic gradient descent rather than gradient descent.
Compute the gradient-descent steps of the optimization model introduced in Section 12.4.5. Show that the gradient-descent steps are as follows:M ⇐ M + αEV Ui ⇐ Ui + αβi(Di − UiV T )V V ⇐ V + αETM + αmi=1βi(Di − UiV T )TUi Here, α is the learning rate, and E is an error matrix E =
Include a concept hierarchy for movie objects based on the genres of the movies.
Consider the IMDB movie database available at the URL https://www.imdb.com/interfaces/. Implement a program to create a heterogeneous information network discussed in Exercise
Consider a repository of movies appearing in different countries. For each movie, you have a hierarchical classification corresponding to the genre. You want to create a heterogeneous network containing four types of objects corresponding to movies, country of origin, actors, and directors. Propose
You may omit the step involving creation of the concept hierarchy.
Consider the DBLP publication database available at the URL https://dblp.uni-trier.de/xml/. Implement a program to create a heterogeneous information network discussed in Exercise
Consider a repository of scientific articles containing articles published in various types of venues. You want to create a heterogeneous network containing three types of objects corresponding to articles, venues, and authors. Propose the various relationship types that you can construct from this
Propose an approach for using RBMs for outlier detection.
Implement the contrastive divergence algorithm of a restricted Boltzmann machine.Also implement the inference algorithm for deriving the probability distribution of the hidden units for a given test example. Use Python or any other programming language of your choice.
This chapter discusses how Boltzmann machines can be used for collaborative filtering.Even though discrete sampling of the contrastive divergence algorithm is used for learning the model, the final phase of inference is done using real-valued sigmoid and softmax activations. Discuss how you can use
What happens for the case when n = ∞?(c) Propose an off-policy n-step learning algorithm like Q-learning and discuss its advantages/disadvantages with respect to (b).
The two-step TD-error is defined as follows:δ(2)t = rt + γrt+1 + γ2V (st+2) − V (st)(a) Propose a TD-learning algorithm for the 2-step case.(b) Propose an on-policy n-step learning algorithm like SARSA. Show that the update is truncated variant of Equation 10.18 after setting λ =
Write a Q-learning implementation that learns the value of each state-action pair for a game of tic-tac-toe by repeatedly playing against human opponents. No function approximators are used and therefore the entire table of state-action pairs is learned using Equation 10.4. Assume that you can
Consider the game of tic-tac-toe in which a reward drawn from {−1, 0, +1} is given at the end of the game. Suppose you learn the values of all states (assuming optimal play from both sides). Discuss why states in non-terminal positions will have non-zero values. What does this tell you about
Consider the well-known game of rock-paper-scissors. Human players often try to use the history of previous moves to guess the next move. Would you use a Q-learning or a policy-based method to learn to play this game? Why? Now consider a situation in which a human player samples one of the three
You have two slot machines, each of which has an array of 100 lights. The probability distribution of the reward from playing each machine is an unknown (and possibly machine-specific) function of the pattern of lights that are currently lit up. Playing a slot machine changes its light pattern in
Throughout this chapter, a neural network, referred to as the policy network, has been used in order to implement the policy gradient. Discuss the importance of the choice of network architecture in different settings.
The chapter gives a proof of the likelihood ratio trick (cf. Equation 10.23) for the case in which the action a is discrete. Generalize this result to continuous-valued actions.
The text of the chapter shows how one can transform any linear classifier into recognizing nonlinear decision boundaries by using a feature engineering phase in which the eigenvectors of an appropriately chosen similarity matrix are used to create new features. Discuss the impact of this type of
Suppose that you represent your data set as a graph in which each data point is a node, and the weight of the edge between a pair of nodes is equal to the Gaussian kernel similarity between them. Edges with weight less than a particular threshold are dropped. Interpret the single-linkage clustering
What is the maximum number of possible clusterings of a data set of n points into k groups? What does this imply about the convergence behavior of algorithms whose objective function is guaranteed not to worsen from one iteration to the next?
Discuss why the following integer matrix factorization is equivalent to the objective function of the k-means algorithm for an n × d matrix D, in which the rows contain the data points:Minimize U,V D − UV T 2 Fsubject to:Columns of U are mutually orthogonal uij ∈ {0, 1}
The text of the book discusses gradient descent updates (cf. Equation 9.6) for unconstrained matrix factorization D ≈ UV T . Suppose that the matrix D is symmetric, and we want to perform the symmetric matrix factorization D ≈ UUT . Formulate the objective function and gradient descent steps of
Discuss the similarity of this model to that of the addition of bias to classification models. How is gradient descent modified?
Biased matrix factorization: Consider the factorization of an incomplete n × d matrix D into an n × k matrix U and a d × k matrix V :D ≈ UV T Suppose you add the constraint that all entries of the penultimate column of U and the final column of V are fixed to
Recommender systems: Let D be an n × d matrix in which only a small subset of the entries are specified. This is commonly the case with recommender systems. Show how you can adapt the algorithm for unconstrained matrix factorization to this case, so that only observed entries are used to create
Suppose that you are given a truncated SVD D ≈ QΣPT of rank-k. Show how you can use this solution to derive an alternative rank-k decomposition QΣPT in which the unit columns of Q (or/and P) might not be mutually orthogonal and the truncation error is the same.
Let D be an n×d data matrix, and y be an n-dimensional column vector containing the dependent variables of linear regression. The regularized solution to linear regression predicts the dependent variables of a test instance Z using the following equation:Prediction(Z) = Z W = Z(DTD + λI)−1DT y
Use singular value decomposition to show the push-through identity for any n × d matrix D:(λId + DTD)−1DT = DT (λIn + DDT )−1
How would your architecture for the previous question change if you were given a training database in which the mutation positions in each sequence were tagged, and the test database was untagged?
Suppose that you have a large database of biological strings containing sequences of nucleobases drawn from {A,C, T,G}. Some of these strings contain unusual mutations representing changes in the nucleobases. Propose an unsupervised method (i.e., neural architecture) using RNNs in order to detect
Propose a neural architecture to perform binary classification of a sequence.
Download the character-level RNN in [222], and train it on the “tiny Shakespeare” data set available at the same location. Create outputs of the language model after training for (i) 5 epochs, (ii) 50 epochs, and (iii) 500 epochs. What significant differences do you see between the three
Perform a 4 × 4 pooling at stride 1 of the input volume in the upper-left corner of Figure 8.4.
Compute the convolution of the input volume in the upper-left corner of Figure 8.2 with the horizontal edge detection filter of Figure 8.1(b). Use a stride of 1 without padding.
Download an implementation of the AlexNet architecture from a neural network library of your choice. Train the network on subsets of varying size from the ImageNet data, and plot the top-5 error with data size.
Work out the number of parameters in each spatial layer for column D of Table 8.1.
Work out the sizes of the spatial convolution layers for each of the columns of Table 8.1.In each case, we start with an input image volume of 224 × 224 × 3.
Justify your answer in each case.
Consider an activation volume of size 13×13×64 and a filter of size 3×3×64. Discuss whether it is possible to perform convolutions with strides 2, 3, 4, and
For a one-dimensional time series of length L and a filter of size F, what is the length of the output? How much padding would you need to keep the output size to a constant value?
Perform a convolution with a 1-dimensional filter 1, 0, 1 and zero padding.
Consider a 1-dimensional time-series with values 2, 1, 3, 4,
Multinomial logistic regression with neural networks: Propose a neural network architecture using the softmax activation function and an appropriate loss function that can perform multinomial logistic regression. You may refer to Chapter 6 for details of multinomial logistic regression.
Convert the weighted computational graph of Figure 7.2 into an unweighted graph by defining additional nodes containing w1 . . . w5 along with appropriately defined hidden nodes.
Consider the computational graph shown in Figure 7.19(b), in which the local derivative∂y(j)∂y(i) is shown for each edge (i, j), where y(k) denotes the activation of node k.The output o is 0.1, and the loss L is given by −log(o). Compute the value of ∂L∂xi for each input xi using both the
Consider the computational graph shown in Figure 7.19(a), in which the local derivative∂y(j)∂y(i) is shown for each edge (i, j), where y(k) denotes the activation of node k.The output o is 0.1, and the loss L is given by −log(o). Compute the value of ∂L∂xi for each input xi using both the
Consider the computational graph of Figure 7.10. The upper node in each layer computes sin(x + y) and the lower node in each layer computes cos(x + y) with respect to its two inputs. For the first hidden layer, there is only a single input x, and therefore the values sin(x) and cos(x) are computed.
Consider the computational graph of Figure 7.10. For a particular numerical input x =a, you find the unusual situation that the value ∂y(j)∂y(i) is 0.3 for each and every edge (i, j) in the network. Compute the numerical value of the partial derivative of the output with respect to the input x
Use the pathwise aggregation lemma to compute the derivative of y(10) with respect to each of y(1), y(2), and y(3) as an algebraic expression (cf. Figure 7.11). You should get the same derivative as obtained using the backpropagation algorithm in the text of the chapter.
All-pairs node-to-node derivatives: Let y(i) be the variable in node i in a directed acyclic computational graph containing n nodes and m edges. Consider the case where one wants to compute S(i, j) = ∂y(j)∂y(i) for all pairs of nodes in a computational graph, so that at least one directed path
Forward Mode Differentiation: The backpropagation algorithm needs to compute node-to-node derivatives of output nodes with respect to all other nodes, and therefore computing gradients in the backwards direction makes sense. Consequently, the pseudocode on page 228 propagates gradients in the
Consider a neural network in which a vectored node v feeds into two distinct vectored nodes h1 and h2 computing different functions. The functions computed at the nodes are h1 = ReLU(W1v) and h2 = sigmoid(W2v). We do not know anything about the values of the variables in other parts of the network,
For Exercise 11, show the following loss-to-weight derivatives:∂L∂U=tp=1∂L(op)∂op hT p ,∂L∂W=tp=2Δp−1∂L∂hp hT p−1,∂L∂V=tp=1Δp∂L∂hp xTp What are the sizes and ranks of these matrices?
Suppose that the output structure of the neural network in Exercise 9 is changed so that there are k-dimensional outputs o1 . . . ot in each layer, and the overall loss is L = t i=1 L(oi). The output recurrence is op = Uhp. All other recurrences remain the same. Show that the backpropagation
Show that if we use the loss function L(o) in Exercise 9, then the loss-to-node gradient can be computed for the final layer ht as follows:∂L(o)∂ht= UT ∂L(o)∂o The updates in earlier layers remain similar to Exercise 9, except that each o is replaced by L(o). What is the size of each matrix
Consider a neural network that has hidden layers h1 . . . ht, inputs x1 . . . xt into each layer, and outputs o from the final layer ht. The recurrence equation for the pth layer is as follows:o = Uht hp = tanh(Whp−1 + V xp) ∀p ∈ {1 . . . t}The vector output o has dimensionality k, each hp
Consider the neural architecture with connections between alternate layers, as shown in Figure 7.15(b). Suppose that the recurrence equations of this neural network are as follows:h1 = ReLU(W1x)h2 = ReLU(W2x +W3h1)y = W4h2 Here, W1, W2, W3, and W4 are matrices of appropriate size. Use the
Discuss why the dynamic programming algorithm for computing the gradients will not work in the case where the computational graph contains cycles.
Consider a computational graph in which you are told that the variables on the edges satisfy k linear equality constraints. Discuss how you would train the weights of such a graph. How would your answer change, if the variables satisfied box constraints. [The reader is advised to refer to the
Suppose that you have a computational graph with the constraint that specific sets of weights are always constrained to be at the same value. Discuss how you can compute the derivative of the loss function with respect to these weights. [Note that this trick is used frequently in the neural network
Let f(x) be defined as follows:f(x) = sin(x) + cos(x)Consider the function f(f(f(f(x)))). Write this function in closed form to obtain an appreciation of the awkwardly long function. Evaluate the derivative of this function at x = π/3 radians by using a computational graph abstraction.
The book discusses a vector-centric view of backpropagation in which backpropagation in linear layers can be implemented with matrix-to-vector multiplications. Discuss how you can deal with batches of training instances at a time (i.e., mini-batch stochastic gradient descent) by using
Repeat Exercise 1 with the changed setting that you want to simulate Widrow-Hoff learning (least-squares classification) with the same computational graph. What will be the loss function associated with the single output node?
The discussion on page 215 proposes a loss function for the L1-SVM in the context of a computational graph. How would you change this loss function, so that the same computational graph results in an L2-SVM?
Discuss how you can generate rules from a decision tree. Comment on the relationship between decision trees and nearest neighbor classifiers.
Hinge-loss without margin: Suppose that we modified the hinge-loss on page 179 by removing the constant value within the maximization function as follows:J =ni=1 max{0, (−yi[W · X Ti ])} +λ2W2 This loss function is referred to as the perceptron criterion. Derive the stochastic gradient
The L1-loss regression model uses a modified loss function in which the L1-norm of the error is used to create the objective function (rather than the squared norm).Derive the stochastic gradient-descent updates of L1-loss regression.
The goal of this exercise is to show that the stochastic gradient-descent updates for various machine learning models are closely related. The updates for the least-squares classification, SVM, and logistic-regression models can be expressed in a unified way in terms of a model-specific mistake
The text of the chapter introduces the loss function of the L2-loss SVM, but it does not discuss the update used by stochastic gradient descent. Derive the stochastic gradient descent update for the L2-loss SVM.
Discuss why any linear classifier is a special case of a rule-based classifier.
Let P(x, y) denote the fact that x is the parent of y. Use the equality operator to write a first-order expression asserting that everyone has exactly two parents.
Suppose that the following statements are true:(∃x)(A(x) ∧ B(x)) ⇒ (∀x)(C(x) ∧ D(x))(∃x)A(x) ⇒ (∀x)(B(x) ∧ C(x))A(c)Show that the following statement is true:(∀x)(B(x) ∧ D(x))
Suppose that the following statements are true:(∃x)(A(x) ⇒ (∀y)(F(y) ⇒ C(y)))(∃x)(B(x) ⇒ (∀y)(E(y) ⇒ ¬F(y)))¬(∃x)(C(x) ∧ ¬E(x))Then show using the laws of first-order logic that the following statement is true:(∃x)(A(x) ∧ B(x)) ⇒¬(∃x)F(x)
Convert the following set of sentences into symbolic form and then construct a proof for the final sentence based on the first two sentences::Anyone who studies both artificial intelligence and botany is cool. There is at least one person who studies botany but is not cool. Therefore, there must be
Convert the following set of sentences into symbolic form and construct a proof for the final sentence based on the first two sentences::Wherever there are deer, there are also lions. There is at least one deer in Serengeti. Therefore, there must also be at least one lion in Serengeti.Use D(x) and
Suppose that the following statements are true:(∀x) (A(x) ∧ B(x) ⇒ C(x))(∃x) (B(x) ∧ ¬C(x))Show using the laws of first-order logic that the following statement is true:(∃x)¬A(x)
Consider the following first-order definite rules, which are all universally quantified in terms of the variable x:A(x) ⇒ B(x)B(x) ⇒ C(x)B(x) ∧ C(x) ⇒ D(x)Suppose that the statement A(John) is true. Use forward chaining to show that D(John) is also true.
Find a way to express the statement ∀x [¬A(x) ⇒¬B(x)] in first-order definite form.
Consider the following expression:(A(x) ∨ B(x)) ∧ (A(x) ∨ ¬B(John)) ∧ ¬A(Tom)This expression is in conjunctive normal form, and the variable x is universally quantified.Use unification to show that the expression always evaluates to False.
Let E(x, y) denote the statement that x eats y. Which of the following is a way of stating that Tom eats fish or beef? (a) E(Tom, Fish ∨ Beef), (b) E(Tom, Fish) ∨E(Tom,Beef).
Consider the following statement in first-order logic:∀xP(x) ∨ ∀yQ(y)Suppose that the domain of discourse for each of the atomic formulas P(·) and Q(·)is {a1, a2, a3}. Write the statement using propositional logic.
Consider the following knowledge base:∀x[A(x) ∧ ¬A(John) ⇒ B(x)]Is this statement a tautology? Is the following statement a tautology?∃x[A(x) ∧ ¬A(John) ⇒ B(x)]
Consider the following knowledge base:∀x[A(x) ⇒ B(x)]∀x[B(x) ∧ C(x) ⇒ D(x)]A(John)Does this knowledge base entail D(John)?
Argue using the connections of first-order logic to propositional logic as to why the following statements are tautologies:(∀x)(A(x) ∧ B(x)) ≡ [(∀y)A(y)] ∧ [(∀z)B(z)](∃x)(A(x) ∧ B(x)) ⇒ [(∃y)A(y)] ∧ [(∃z)B(z)]Argue by counterexample, why the converse of the unidirectional
Consider a knowledge case containing two sentences. The first sentence is ∃x [A(x) ⇒B(x)], and the second is A(John). Does this knowledge base entail B(John)?
Is the statement ∀xP(x) ⇒ ∃yP(y) a tautology?
Consider the following two statements:If Alice likes daisies, she also likes roses. Alice does not like daisies.Do the above sentences entail the following?Alice does not like roses.
Suppose that you had an algorithm to determine in polynomial time whether an expression is tautology. Use this algorithm to propose a polynomial-time algorithm whether an expression is satisfiable. Given that satisfiability is NP-complete, what can you infer about the computational complexity of
Create truth tables for the XOR, NAND, and NOR logical operators.
For the knowledge base and goal clause of Exercise 11, simulate a forward chaining procedure to show that the knowledge base entails the goal q.
Consider a knowledge base containing the following rules and positive facts:a ∧ c ∧ d ⇒ q e ⇒ a e ⇒ c f ⇒ e df Simulate a backward chaining procedure on this toy knowledge base to show that it entails the goal q.
Showing 1000 - 1100
of 4756
First
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Last
Step by Step Answers