Question: 1 Consider a 5 0 0 x 5 0 0 grid world where the agent starts each episode in the bottom left corner and the

1 Consider a 500x500 grid world where the agent starts each episode in the bottom left corner and the goal is to reach the top-right corner in the least number of steps. To learn an optimal policy in this setup you decided on a reward function wherein the agent receives a reward of +1 on reaching the goal and 0 otherwise. Suppose you try to learn the optimal policy in two ways:
A: You use discounted returns with in (0,1) &
B: You use total return with no discounting.
Which among the following you expect to observe and why?
a) the same policy is learnt in A and B
b) no learning in A
c) no learning in B
d) policy learnt in B is better than the policy learned in A

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!