Question: 1 . Consider the following Markov decision process, with the gridworld and transition function as illustrated below. The states are grid squares, identified by their
Consider the following Markov decision process, with the gridworld and transition function as illustrated
below. The states are grid squares, identified by their row and column number row first The agent
always starts in state marked with the letter There are two terminal goal states, with reward
and with reward Rewards are in nonterminal states. The reward for a state is received
as the agent moves into the state The transition function is such that the intended agent movement
North South, West, or East happens with probability With probability each, the agent ends up
in one of the states perpendicular to the intended direction. If a collision with a wall happens, the agent
stays in the same state.a Gridworld MDPb Transition function.
a Draw the optimal policy for this grid.
b Suppose the agent knows the transition probabilities. Give the first two rounds of value iteration
updates for each state, with a discount of Assume is everywhere and compute for times
c Suppose the agent does not know the transition probabilities. What does it need or must it have
available in order to learn the optimal policy?
d The agent starts with the policy that always chooses to go right, and executes the following three
trials:
and
What are the monte carlo direct utility estimates for states and given these traces?
e Using a learning rate of and assuming initial values of what updates does the TDlearning agent
make after trials and above? First give the TDlearning update equation, and then provide the
updates after the two trials.
Consider the MDP below, in which there are two states, and two sctions, right and left, and the
deterministic rewards on each transition are as indicated by the numbers. Note that if action right is
taken in state then the transition may be either to with a reward of or to with a reward of
These two possibilities occur with probabilities for the transition to and for the transition
to state
Consider two deterministic policies:
left, right
right, right
a Show a typical trajectory for policy from state
b Show a typical trajectory for policy from state
c Assuming the value of state under policy is :
d Assuming the actionvalue of left under policy is:
The plot below shows the target value per time step as a grey line. Apply the equation below
to determine the estimates for time step to Draw your answers on the graph below, where the first
value is provided in blue. Show all your calculations.
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
