Question: 2. Consider the gridworld in Fig. Q2 where only two actions are possible in each state. The possible actions for A = {Right, Exit), B

2. Consider the gridworld in Fig. Q2 where only

2. Consider the gridworld in Fig. Q2 where only two actions are possible in each state. The possible actions for A = {Right, Exit), B = {Left, Right) and C = {Left, Exit). Rewards (R) are available when the agent takes the Exit action in states A and C. All actions are 100% successful. In this scenario, the discount is y= 1 and a = 0.5. +2 +8 A B R(A, Exit) R(C, Exit) Fig. 22 Assume that the initial estimate of the value function VT for each state is zero as follows: 0 0 0 VT(A) V"(B) V"(C) Unfortunately, we do not know the details of the MDP, so we use reinforcement learning to compute various values. Here are the training episodes: Episode 1 Episode 2 Episode 3 Episode 4 A, exit, x, +2 A, east, B, -1 A, east, B, -1 A, east, B, -1 B, east, C, -1 B, west, A, -1 B, west, A, -1 C, exit, x, +8 A, east, B, -1 A, exit, x, +2 B, east, C, -1 C, exit, x, +8 Use temporal difference (TD) learning 1"(9) + (1 - 0)V (s+ a R(8,7(s).:') +^V" (S) find the values of each state after the 4 episodes of learning and write your answer in the table below. V"(A) V"(B) V"(C)

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related General Management Questions!