Question: 4. [50pts] [Programming problem] The following gridworld problem is a simple exemplar MDP from the book of Reinforcement Learning: An Introduction. Please implement this gridworld
![4. [50pts] [Programming problem] The following gridworld problem is a simple](https://dsd5zvtm8ll6.cloudfront.net/si.experts.images/questions/2024/09/66f4f2ea1d235_00166f4f2e997e2b.jpg)
4. [50pts] [Programming problem] The following gridworld problem is a simple exemplar MDP from the book of Reinforcement Learning: An Introduction. Please implement this gridworld problem, and implement the iterative policy evaluation algorithm to estimate the state values of the equiprobable action policy (all four actions have the equal chance to be taken at any state). Hint: This is an MDP, and two types of problems can be associated with an MDP: prediction (given a policy, predict its state (or action) values): control (find the optimal action policy to maximize the state (or action) values). This homework problem asks to solve the prediction problem for a given policy (which is the equiprobable policy). The final state values after iterations should look similar to the table on the right-hand side of Fig. 3.2 below. Example 3.5: Gridworld Figure 3.2 (left) shows a rectangular gridworld representation of a simple finite MDP. The cells of the grid correspond to the states of the environment. At each cell, four actions are possible: north, south, east, and west, which deterministically cause the agent to move one cell in the respective direction on the grid. Actions that would take the agent off the grid leave its location unchanged, but also result in a reward of -1 . Other actions result in a reward of 0 , except those that move the agent out of the special states A and B. From state A, all four actions yield a reward of +10 and take the agent to A. From state B, all actions yield a reward of +5 and take the agent to B. Figure 3.2: Gridworld example: exceptional reward dynamies (left) and state-value function for the equiprobable random policy (right). Iterative Policy Evaluation, for estimating Vv Input , the policy to be evaluated Algorithm parameter: a small threshold >0 determining accuracy of estimation Initialize V(s) arbitrarily, for sS, and V (terminal) to 0 Loop: 0 Loop for each s8 : vV(s)V(s)a(as)s,rp(s,rs,a)[r+V(s)]max(,vV(s))until
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
