Question: Q4. Model-free Reinforcement Learning: Cycle (20 points) Consider an MDP with 3 states, A, B and C; and 2 actions Clockwise and Counterclockwise. We do

 Q4. Model-free Reinforcement Learning: Cycle (20 points) Consider an MDP with

Q4. Model-free Reinforcement Learning: Cycle (20 points) Consider an MDP with 3 states, A, B and C; and 2 actions Clockwise and Counterclockwise. We do not know the transition function or the reward function for the MDP, but instead, we are given with samples of what an agent actually experiences when it interacts with the environment (although, we do know that we do not remain in the same state after taking an action). In this problem, instead of first estimating the transition and reward functions, we will directly estimate the Q function using Q-learning. Assume, the discount factor, y is 0.5 and the step size for Q-learning, a is 0.5. Our current Q function, Q(s,a), is shown in the left figure. The agent encounters the samples shown in the right figure: B a S r s' Clockwise 1.501 -0.451 2.73 Counterclockwise 3.153 -6.055 2.133 A Counterclockwise 8.0 Counterclockwise A 0.0 Provide the Q-values for all pairs of (state, action) after both samples have been accounted for. Q4. Model-free Reinforcement Learning: Cycle (20 points) Consider an MDP with 3 states, A, B and C; and 2 actions Clockwise and Counterclockwise. We do not know the transition function or the reward function for the MDP, but instead, we are given with samples of what an agent actually experiences when it interacts with the environment (although, we do know that we do not remain in the same state after taking an action). In this problem, instead of first estimating the transition and reward functions, we will directly estimate the Q function using Q-learning. Assume, the discount factor, y is 0.5 and the step size for Q-learning, a is 0.5. Our current Q function, Q(s,a), is shown in the left figure. The agent encounters the samples shown in the right figure: B a S r s' Clockwise 1.501 -0.451 2.73 Counterclockwise 3.153 -6.055 2.133 A Counterclockwise 8.0 Counterclockwise A 0.0 Provide the Q-values for all pairs of (state, action) after both samples have been accounted for

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!