Question: Need help with this problem, can anyone help please ? Consider the MDP shown below. It has 6 states and 4 actions. As shown on

Need help with this problem, can anyone help please ?

Need help with this problem, can anyone help please ? Consider the

Consider the MDP shown below. It has 6 states and 4 actions. As shown on the figure, the transitions for all actions have a Pr = 0.7 of succeeding (and leading to the state shown by the arrow) and Pr = 0.3 of failing (in which case the agent stays in place). For other transitions that are not shown, assume that they cause the state to stay the same (e.g. T(S1, Left, S1) = 1). The rewards depend on state only and are shown in each node (state); rewards are the same for all actions (e.g. R(S4, a)= + 10, a). Assume a discount factor of gamma =0.9. Describe the space of all possible policies for this MDP. How many are there? Assuming an initial policy pi_0 (s)=Right, s, perform policy evaluation to get the initial value function for each state, V^0(s), s. Given the initial estimate, V^0, if you run an iteration of policy improvement, what will be the new policy at each state? If necessary, break ties alphabetically, e.g. "Down" before "Left", etc.) What is the optimal value function at each state for this domain? Is the optimal value function unique? Explain. What is the optimal policy at each state for this domain? Is the optimal policy unique? Explain. Suggest a change to the reward function that changes the value function but does not change the optimal policy. Consider the MDP shown below. It has 6 states and 4 actions. As shown on the figure, the transitions for all actions have a Pr = 0.7 of succeeding (and leading to the state shown by the arrow) and Pr = 0.3 of failing (in which case the agent stays in place). For other transitions that are not shown, assume that they cause the state to stay the same (e.g. T(S1, Left, S1) = 1). The rewards depend on state only and are shown in each node (state); rewards are the same for all actions (e.g. R(S4, a)= + 10, a). Assume a discount factor of gamma =0.9. Describe the space of all possible policies for this MDP. How many are there? Assuming an initial policy pi_0 (s)=Right, s, perform policy evaluation to get the initial value function for each state, V^0(s), s. Given the initial estimate, V^0, if you run an iteration of policy improvement, what will be the new policy at each state? If necessary, break ties alphabetically, e.g. "Down" before "Left", etc.) What is the optimal value function at each state for this domain? Is the optimal value function unique? Explain. What is the optimal policy at each state for this domain? Is the optimal policy unique? Explain. Suggest a change to the reward function that changes the value function but does not change the optimal policy

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!