Question: Need help with this problem, can anyone help please ? Consider the MDP shown below. It has 6 states and 4 actions. As shown on

Need help with this problem, can anyone help please ?

Need help with this problem, can anyone help please ? Consider the

Consider the MDP shown below. It has 6 states and 4 actions. As shown on the figure, the transitions for all actions have a Pr = 0.7 of succeeding (and leading to the state shown by the arrow) and Pr = 0.3 of failing (in which case the agent stays in place). For other transitions that are not shown, assume that they cause the state to stay the same (e.g. T(S1, Left, S1) = 1). The rewards depend on state only and are shown in each node (state); rewards are the same for all actions (e.g. R(S4, a)= + 10, a). Assume a discount factor of gamma =0.9. Describe the space of all possible policies for this MDP. How many are there? Assuming an initial policy pi_0 (s)=Right, s, perform policy evaluation to get the initial value function for each state, V^0(s), s. Given the initial estimate, V^0, if you run an iteration of policy improvement, what will be the new policy at each state? If necessary, break ties alphabetically, e.g. "Down" before "Left", etc.) What is the optimal value function at each state for this domain? Is the optimal value function unique? Explain. What is the optimal policy at each state for this domain? Is the optimal policy unique? Explain. Suggest a change to the reward function that changes the value function but does not change the optimal policy. Consider the MDP shown below. It has 6 states and 4 actions. As shown on the figure, the transitions for all actions have a Pr = 0.7 of succeeding (and leading to the state shown by the arrow) and Pr = 0.3 of failing (in which case the agent stays in place). For other transitions that are not shown, assume that they cause the state to stay the same (e.g. T(S1, Left, S1) = 1). The rewards depend on state only and are shown in each node (state); rewards are the same for all actions (e.g. R(S4, a)= + 10, a). Assume a discount factor of gamma =0.9. Describe the space of all possible policies for this MDP. How many are there? Assuming an initial policy pi_0 (s)=Right, s, perform policy evaluation to get the initial value function for each state, V^0(s), s. Given the initial estimate, V^0, if you run an iteration of policy improvement, what will be the new policy at each state? If necessary, break ties alphabetically, e.g. "Down" before "Left", etc.) What is the optimal value function at each state for this domain? Is the optimal value function unique? Explain. What is the optimal policy at each state for this domain? Is the optimal policy unique? Explain. Suggest a change to the reward function that changes the value function but does not change the optimal policy

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Q4. Model-free Reinforcement Learning: Cycle (20 points) Consider an MDP with 3 states, A, B and C; and 2 actions Clockwise and Counterclockwise. We do not know the transition function or the reward...

Problem 3 . ( 2 5 points ) Consider the MDP shown below. There are five states { A , B , C , D , T } in MDP , where state { T } is the target state, and two actions { a 1 , a 2 } . The numbers on...

This question calls for a straightforward application of definitions introduced in the Week 6 lecture. Consider the MDP shown in the figure below. It has two states: s1 and s2; and three actions: a,...

Problem 5 (30 marks) Re-implement in Python the results presented in Figure 6.4 of the Sutton & Barto book comparing SARSA and Q-learning in the cliff-walking task. Investigate the effect of choosing...

Consider an MDP with 3 states, A, B and C; and 2 actions Clockwise and Counterclockwise. We do not know the transition function or the reward function for the MDP, but instead, we are given with...

[Solutions to this assignment must be submitted vio CANVAS prior to midnight on the due dote. These dates and times vory depending on the milestone to be submitted. Submissions up to one day late...

MDP is an acronym for Markov Decision Process. This problem is about reinforcement learning and .MDP Please need help with some reinforcement learning and Markov Decision Process. Advance probability...

i want complete solution for my assignment and it should be without plagiarism COIT20274: Information Systems for Business Professionals, Term One 2016 Assignments 1 & 2 Requirements Assignment 1 -...

A.D. Lock owned Lock Hospitality, Inc., which in turn owned the Best Western Motel in Conway, Arkansas. Joe Terry and David Stocks were preparing the motel for renovation. As they were removing the...

Figure is the digraph of tournament with six players, P1 to P6. Using adjacency matrices rank the player first by determining wins only and then by using the notion of combined wins and indirect...

Using illustrations, explain how the two are used to determine whether a company goes ahead with a particular investment or not.

Seved Help 14 Wisconsin Snowmobile Corp. is considering a switch to level production Cost efficiencies would occur under level production, and aftertax costs would decline by $31,500, but inventory...

How can Federal jobs in the same GS Pay Grade be considered jobs of Comparable Worth?

What is the Salary Range Midpoint and how does it relate to the Pay Policy Line? For which analytic is it important?

How wide are Salary Structure Ranges?