A learning agent interacts with an MDP (S, A, T, R, 7), where S = A...
Fantastic news! We've Found the answer you've been seeking!
Question:
Transcribed Image Text:
A learning agent interacts with an MDP (S, A, T, R, 7), where S = A = {a1, a2, a3). No discounting is used (y = 1). The agent begins with the Q-table given below as initialisation Qº. S $1 82 83 Qº(s, a) a1 220 a a2 -3 2 -1 a3 1 2 1 {81, 82, 83) and The agent uses e-greedy exploration with € = 0.15, and a learning rate a = 0.1 (both are con- stants, not annealed over time). Starting from state s2, suppose that the agent's first transition is (82, 01, 2, 83) (the next state is s3 and the reward 2). From $3, the agent decides to take action a2. Thus, the agent's trajectory is s2, a1, 2, 83, a2, .... What is Q¹ the Q-table after making the first learning update? Give your answer for each of Q-learning, Sarsa, and Expected Sarsa being used for making the update. In each case provide the complete 3 x 3 table for Q¹. In the absence of ties, note that a 0.15-greedy policy will pick the "argmax" action with probability 0.9, and each of the other two actions with probability 0.05. A learning agent interacts with an MDP (S, A, T, R, 7), where S = A = {a1, a2, a3). No discounting is used (y = 1). The agent begins with the Q-table given below as initialisation Qº. S $1 82 83 Qº(s, a) a1 220 a a2 -3 2 -1 a3 1 2 1 {81, 82, 83) and The agent uses e-greedy exploration with € = 0.15, and a learning rate a = 0.1 (both are con- stants, not annealed over time). Starting from state s2, suppose that the agent's first transition is (82, 01, 2, 83) (the next state is s3 and the reward 2). From $3, the agent decides to take action a2. Thus, the agent's trajectory is s2, a1, 2, 83, a2, .... What is Q¹ the Q-table after making the first learning update? Give your answer for each of Q-learning, Sarsa, and Expected Sarsa being used for making the update. In each case provide the complete 3 x 3 table for Q¹. In the absence of ties, note that a 0.15-greedy policy will pick the "argmax" action with probability 0.9, and each of the other two actions with probability 0.05.
Expert Answer:
Related Book For
Posted Date:
Students also viewed these computer engineering questions
-
Suppose that A1, A2 , A3 , and B are events where A1 , A2 , and A3 are mutually exclusive and P (A1) =.2 P (A2) =.5 P (A3) = .3 P (B A1) = .02 P (B A2) =.05 P( B A3 )=.04 Use this information to find...
-
The nuclei involved in the nuclear reaction A1 + A2 A3 + A4 have the binding energies E1, E2, E3, and E4. Find the energy of this reaction.
-
Suppose at time t = 0, we are given four zero-coupon bond prices {B1, B2, B3, B4} that mature at times t = 1, 2, 3, 4. This forms the term structure of interest rates. We also have one-period forward...
-
Summarize the extract given below? When Paul Farmer graduated from Duke University at 22, he was unsure whether he wanted to be an anthropologist or a doctor. So he went to Haiti. As a student, Paul...
-
Berg Company adopted a share-option plan on November 30, 2009, that provided that 70,000 shares of $5 par value ordinary shares be designated as available for the granting of options to officers of...
-
Matrix A represents a digital photograph. Find a matrix B that represents the negative image of 1. 030 131 2 32
-
A window manufacturer has a commitment to lean production. What is not a benefit from adopting a lean production philosophy? a. More space available for production b. Ability to continue production...
-
Compare and contrast the following techniques based on costs and benefits: Test data method Base case system evaluation Tracing Integrated test facility Parallel simulation
-
Stewiacke Ltd. is currently considering a project with a four-year life that it believes may return the company to profitability. Stewiacke recently did a market survey at a cost of $100,000. The...
-
Ellen considered saving $10,000 per year for her retirement. Although $10,000 is the most she can save in the first year, she expects her salary to increase each year so that she will be able to...
-
The following data were collected during a study of consumer buying patterns: Observation x y Observation x y 1 11 77 8 23 78 2 30 77 9 10 69 3 44 82 10 19 70 4 34 84 11 23 87 5 47 92 12 19 90 6 42...
-
1. Identify and fix the errors in this code: public class Test{ } public static method 1 (int n,m){ n += m; method2 (3.4); } public static int method2 (int n){ if(n > 0) return 1; else if (n == 0)...
-
For a company like Peloton, what would be its system and leadership using the formulation and implementation model? Can you explain system and leadership in depth and with examples?
-
Consider a thin bar of length 16 with heat distribution T(z, t), where (a) Suppose 7 satisfies homogeneous Bs and the IC i. What are un and An (n=1,2,...)? WT An T(z,t) (n*Pi/16 Find T'(x, t) by...
-
i 01234S 5 a) Draw a flowchart for the following pseudocode: Do i=i+1 IF Z> 50 Exit X=X+5 IF X > 5 Then Y=X ELSE Y-0 ENDIF Z=X+Y END DO b) From the above pseudocode complete the table below X 0 Y 0 Z...
-
The figure shows a timber panorama terrace. Each internal plank has a width of b and a length of t. The characteristic self-weight is g0p,k (on unit area) for floor planking, g0b,k (on unit length)...
-
Fabrick Company's quality cost report is to be based on the following data: Lost sales due to poor quality $ 15,200 Quality data gathering, analysis, and reporting$ 64,700 Net cost of spoilage $...
-
What are the main distinctions between the different schools of legal interpretation?
-
Determine the two 2-values that divide the area under the curve into a middle 0.95 area and two outside 0.025 areas for a 2-curve with a. df = 5. b. df = 26. Find the required 2-values. Illustrate...
-
As reported by the Associated Press, a veteran lobsterman recently hauled up a yellow lobster less than a quarter mile south of Prince Point in Harpswell Cove, Maine. Yellow lobsters are considerably...
-
What is another name for the empirical rule? Why is that name appropriate?
-
Suppose that a companys ROE is 25 percent and its retention rate is 60 percent. According to the expression for the sustainable growth rate, the dividends should grow at g = b ROE = 0.60 25 percent...
-
Baggai Enterprises has an ROA of 10 percent, retains 30 percent of earnings, and has an equity multiplier of 1.25. Mondale Enterprises also has an ROA of 10 percent, but it retains two-thirds of...
-
International Business Machines (NYSE: IBM), which currently pays a dividend of \($3.40\) per share, has been the subject of two other examples in this reading. In one example, an analyst estimated...
Study smarter with the SolutionInn App