Question: Problem 2 . Consider a MDP with two states S = { 0 , 1 } , two actions A = { 1 , 2
Problem Consider a MDP with two states two actions and the
follow reward function
and the transition probabilities as follows:
The other probabilities can be deduced, for example:
The discount factor is
Exercise on modelfree prediction:
a For the policy that chooses action in state and action in state starting from
state generate one episode of triplets of dots, with
b Based on the episode use Monte Carlo policy evaluation to estimate the value
function
c Based on the episode use step temporal difference policy evaluation to estimate
the value function
Exercise on modelfree control:
a Use the SARSA algorithm to estimate the optimal actionvalue function by
running the algorithm in Sutton and Barto's book nd edition, available online
b Use the Qlearning algorithm to estimate the optimal actionvalue function by
running the algorithm in Sutton and Barto's book nd edition, available online
You only need to simulate one episode. In both cases, you will need to decide an appropriate
fixed stepsize and exploration probability and number of time steps in the episode.
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
