Question: 1 Consider a 5 0 0 x 5 0 0 grid world where the agent starts each episode in the bottom left corner and the
Consider a x grid world where the agent starts each episode in the bottom left corner and the goal is to reach the topright corner in the least number of steps. To learn an optimal policy in this setup you decided on a reward function wherein the agent receives a reward of on reaching the goal and otherwise. Suppose you try to learn the optimal policy in two ways:
A: You use discounted returns with in &
B: You use total return with no discounting.
Which among the following you expect to observe and why?
a the same policy is learnt in A and B
b no learning in A
c no learning in B
d policy learnt in B is better than the policy learned in A
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
