Question: Question 6 . ( 6 marks ) Consider the following MDP: the set of states is S = { s 0 , s 1 ,

Question 6.(6 marks)
Consider the following MDP: the set of states is S={s0,s1,s2,s3} and the set of actions available at each
state is A={l,r}. Each episode of the MDP starts in s1 and terminates in s0.
You do not know the transition probabilities or the reward function of the MDP, so you are using Sarsa
to find the optimal policy. Suppose the current Q-values are:
Q(s0,l)=0,Q(s0,r)=0
Q(s1,l)=3.4,Q(s1,r)=-1.8
Q(s2,l)=-0.8,Q(s2,r)=-0.7
Q(s3,l)=-0.5,Q(s3,r)=7.5
Suppose the next episode is as follows:
s1,l,-1,s1,r,-1,s2,l,-1,s1,l,10,s0.
(a)(4 marks) Do all the Sarsa updates to the Q-values that would result from this episode, using =0.25
and =0.9. Show your working.
(b)(1 mark) Based on the updated Q-values, give the final policy determined by Q, i.e., give (s1),(s2)
and (s3). Show your working.
(c)(1 mark) Give an lon-greedy policy based on the Q-values obtained in (a).
 Question 6.(6 marks) Consider the following MDP: the set of states

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!