Question: Alice is taking CS 2 3 4 and has just learned about the Q - values. She is trying to explore a large nitehorizon MDP

Alice is taking CS234 and has just learned about the Q-values. She is trying to explore a large nitehorizon MDP with \gamma =1. The transitions are deterministic and QH+1(s, a)=0 for all s, a. To
help her with her MDP you tell her the optimal policy \pi
(s, t), dened in every state s and timestep
t, that Alice should follow to maximize her reward. Denote with Q
t
(s, a) the Q-values of the optimal
policy upon taking action a in state s at timestep t.
A) First Step Error
In the rst timestep t =1 Alice is in state s1 and chooses action a, which is suboptimal. If she then
follows the optimal policy from t =2 until the end of the episode, what is the value of this policy
compared to the optimal one? Express your result only using Q
1
(s1,).

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!