Question: Question 6 . ( 6 marks ) Consider the following MDP: the set of states is S = { s 0 , s 1 ,
Question marks
Consider the following MDP: the set of states is and the set of actions available at each
state is Each episode of the MDP starts in and terminates in
You do not know the transition probabilities or the reward function of the MDP so you are using Sarsa
to find the optimal policy. Suppose the current values are:
Suppose the next episode is as follows:
a marks Do all the Sarsa updates to the values that would result from this episode, using
and Show your working.
b mark Based on the updated values, give the final policy determined by ie give
and Show your working.
c mark Give an greedy policy based on the values obtained in a
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
