Question: Consider the 3 state problem below. We know that there are two possible actions LEFT and RIGHT from each state, but we do not know

Consider the 3 state problem below. We know that there are

two possible actions LEFT and RIGHT from each state, but we do

Consider the 3 state problem below. We know that there are two possible actions LEFT and RIGHT from each state, but we do not know the anything about the transition model between the state-action pairs. Assume a learning rate of 1/4 and a discount rate of and use the Berkeley formulation of the state value calculation where reward is given on entry into a state. Si S2 Sz I observed the following episodes, where I report the Rk(Si, Action, S) for the reward received for the kth iteration after starting in Si, taking the specified Action, and ending up in Sj. R(S,LEFT,S3) = 8; R2(S3,LEFT,S2) = 32; R3(S2,LEFT,SI) R4(SI,LEFT,S3) = 8; R5(S3,LEFT,SI) 16; R6(SI,LEFT,S2) 32; R (S2, RIGHT.S2) = 32; R$(S2, RIGHT,S3) = 8; R9(S3, RIGHT,SI) = 16 RIO(S.,RIGHT,S3) = 8; R.(S3, RIGHT,SI) = 16; R:2(S.,RIGHT.S2) = 32 16; I used the observations to learn the following estimates for the Q-values: Q(S,LEFT) = 11.906 (S,RIGHT) = 11.705 Q(S2,LEFT) = 4.25 (S2,RIGHT) = 9.719 Q(S3,LEFT) = 10.563 Q(S3, RIGHT) = 9.604 Starting from these Q-values, you observe the following two episodes: R13(S2, RIGHT,SI) = 16 R14(S),RIGHT,S3) = 8 3 questions (show all your work neatly): a) What are the best estimates for the Q-values after these two additional observations? b) What is appropriate policy for all states if you wish to exploit this learned information? c) What is the estimated transition model based on the above information? (sketch above) Clearly indicate your answers to each questions below, and sketch the transition model above with estimated probabilities shown. Consider the 3 state problem below. We know that there are two possible actions LEFT and RIGHT from each state, but we do not know the anything about the transition model between the state-action pairs. Assume a learning rate of 1/4 and a discount rate of and use the Berkeley formulation of the state value calculation where reward is given on entry into a state. Si S2 Sz I observed the following episodes, where I report the Rk(Si, Action, S) for the reward received for the kth iteration after starting in Si, taking the specified Action, and ending up in Sj. R(S,LEFT,S3) = 8; R2(S3,LEFT,S2) = 32; R3(S2,LEFT,SI) R4(SI,LEFT,S3) = 8; R5(S3,LEFT,SI) 16; R6(SI,LEFT,S2) 32; R (S2, RIGHT.S2) = 32; R$(S2, RIGHT,S3) = 8; R9(S3, RIGHT,SI) = 16 RIO(S.,RIGHT,S3) = 8; R.(S3, RIGHT,SI) = 16; R:2(S.,RIGHT.S2) = 32 16; I used the observations to learn the following estimates for the Q-values: Q(S,LEFT) = 11.906 (S,RIGHT) = 11.705 Q(S2,LEFT) = 4.25 (S2,RIGHT) = 9.719 Q(S3,LEFT) = 10.563 Q(S3, RIGHT) = 9.604 Starting from these Q-values, you observe the following two episodes: R13(S2, RIGHT,SI) = 16 R14(S),RIGHT,S3) = 8 3 questions (show all your work neatly): a) What are the best estimates for the Q-values after these two additional observations? b) What is appropriate policy for all states if you wish to exploit this learned information? c) What is the estimated transition model based on the above information? (sketch above) Clearly indicate your answers to each questions below, and sketch the transition model above with estimated probabilities shown

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Accounting Questions!

Consider an undiscounted MDP having three states, ( 1 , 2 , 3 ) , with rewards 1 , 2 , and 0 , respectively. State 3 is a terminal state. In states 1 and 2 there are two possible actions: A and B ....

Question 4 Search ( a ) Consider a problem with two possible actions: a and b . The cost of a is 2 and the cost of b is 1 . The shallowest goal node can be at any level of the tree and the problem...

Due Date: May 28th 2018 at 23:55 Percentage overall grade: 5% Penalties: No late assignments allowed Goal : refresher of Python and hands-on experience with class building and encapsulation....

this file has been updated before. this file is updated information. following codes' in pictures have to follow.(must) Assignment 2 - Class, data structure and encapsulation Due Date: May 28th 2018...

Program in C or Java A 5 puzzle is a 3 X 2 grid with 5 tiles numbered 1...5 and an empty tile as shown in the figures below. A tile next to the empty space can be moved (left, right, up or down but...

Consider a Markov chain with three states { 1 , 2 , 3 } . In each state, we can choose one of the two possible actions { 1 , 2 } . The transition probability matrices under the two actions are given...

Step 1 We start in the START state ( in the rotunda ) , and we have four action options that represent the four paths that we can take through the caves: "Gold Vault , Escape Path", Cave Troll and...

Suppose two players are asked to split $100 in a way that is agreeable to both. A: The structure for the game is as follows: Player 1moves firstand he is asked to simply state some number between...

Question 1 ( a ) Consider a simple game where your character is a sailor carrying passengers across a river that separates two towns, A and B . Each day you can decide to stay in the town where you...

Consider an undiscounted MDP having three states, (1, 2, 3), with rewards 1, 2, 0 respectively. State 3 is a terminal stale. In states I and 2 there are two possible actions: a and b. The transition...

The heater element of an electric kettle has a constant resistance of 105 and applied voltage of 240V. 3.1 Calculate time taken to raise the temperature of one litre of water from 15oC to 90oC...

What went wrong with this bankruptcy case?

Question 1 5 ( 2 points ) All of the following transfers to a RRIF can be done on a tax - deferred basis, EXCEPT: a direct transfer from an unmatured RRSPa lump sum, direct transfer from a registered...

Discuss the policy statement concept and identify three focus areas / procedures where a clearpolicy statement / operating procedure

What bases of power are associated with networking? Explain your answer.

According to the case, finding and obtaining mentors seems to be one of the goals of developing and maintaining a strong network. Explain why

What is the networking process and would Sheryl agree that this is the right process to follow?