Question: Q4. Model-free Reinforcement Learning: Cycle (20 points) Consider an MDP with 3 states, A, B and C; and 2 actions Clockwise and Counterclockwise. We do

Q4. Model-free Reinforcement Learning: Cycle (20 points) Consider an MDP with 3 states, A, B and C; and 2 actions Clockwise and Counterclockwise. We do not know the transition function or the reward function for the MDP, but instead, we are given with samples of what an agent actually experiences when it interacts with the environment (although, we do know that we do not remain in the same state after taking an action. In this problem, instead of first estimating the transition and reward functions, we will directly estimate the Q function using Q-learning. Assume, the discount factor, y is 0.5 and the step size for Q-learning, a is 0.5. Our current Q function, Q(s,a), is shown in the left figure. The agent encounters the samples shown in the right figure: s' r A B Clockwise 1.501 -0.451 2.73 Counterclockwise 3.153 -6.055 2.133 A Counterclockwise 8.0 Counterclockwise A 0.0 Provide the Q-values for all pairs of (state, action) after both samples have been accounted for. Q4. Model-free Reinforcement Learning: Cycle (20 points) Consider an MDP with 3 states, A, B and C; and 2 actions Clockwise and Counterclockwise. We do not know the transition function or the reward function for the MDP, but instead, we are given with samples of what an agent actually experiences when it interacts with the environment (although, we do know that we do not remain in the same state after taking an action. In this problem, instead of first estimating the transition and reward functions, we will directly estimate the Q function using Q-learning. Assume, the discount factor, y is 0.5 and the step size for Q-learning, a is 0.5. Our current Q function, Q(s,a), is shown in the left figure. The agent encounters the samples shown in the right figure: s' r A B Clockwise 1.501 -0.451 2.73 Counterclockwise 3.153 -6.055 2.133 A Counterclockwise 8.0 Counterclockwise A 0.0 Provide the Q-values for all pairs of (state, action) after both samples have been accounted for
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
