Question: The agent is in a 2 4 gridworld as shown in the figure. We start from square 1 and finish in square When square 8

The agent is in a 24 gridworld as shown in the figure. We start from square 1 and finish in square
When square 8 is reached, we receive a reward of +10 at the game end. For anything else, we
receive a constant reward of -1(you can think of this as a time penalty).
The actions in this MDP include: up, down, left and right. The agent cannot take actions that take
them off the board. In the table below, we provide initial non-zero estimates of Q values (Q values
for invalid actions are left as blanks):
Your friend Adam guesses that the actions in this MDP are fully deterministic (e.g. taking
down from 2 will land you in 6 with probability 1 and everywhere else with probability 0).
Since we have full knowledge of T and R, we can thus use the Bellman equation to improve
(i.e., further update) the initial Q estimates. Adam tells you to use the following update rule
for Q values, where he assumes that your policy is greedy and thus does max a Q(s,a). The
updated rule he prescribes is as follows:
Qk+1(s,a)=s'?T(s,a,s')[R(s,a,s')+maxa'Qk(s',a')]
a. Perform one update of , left
 The agent is in a 24 gridworld as shown in the

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!