Question: After observing the agent for a while, Adam realized that his assumption of T being deterministic is wrong in one specific way: when the agent

After observing the agent for a while, Adam realized that his assumption of
T being deterministic is wrong in one specific way: when the agent tries to legally move down, it occasionally ends up moving left instead (except from grid 1 where moving left results in out-of-bound). Adam still guesses that all other movements are still deterministic.
Suppose we have run Adam's suggested updates until convergence, to get
wrong
(
,
)
Q
wrong
(s,a) under the original assumption of the wrong (deterministic)
T.
Suppose
correct
(
,
)
Q
correct
(s,a) denotes the Q values under the new correct
T (where the agent sometimes moves left instead of down).
Note that you don't explicitly know the exact probabilities associated with this new
T (i.e. you don't know how often the agent moves left instead of down), but you know that it qualitatively differs in the way described above.
Question 3.1
Q3.16 Points
Grading comment:
For which
(
,
)
(s,a) pairs will
wrong
(
,
)
Q
wrong
(s,a) be an overestimate of
correct
(
,
)
Q
correct
(s,a)?

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!