Question: Q4. Model-free Reinforcement Learning: Cycle (20 points) Consider an MDP with 3 states, A, B and C; and 2 actions Clockwise and Counterclockwise. We do

Q4. Model-free Reinforcement Learning: Cycle (20 points) Consider an MDP with

Q4. Model-free Reinforcement Learning: Cycle (20 points) Consider an MDP with 3 states, A, B and C; and 2 actions Clockwise and Counterclockwise. We do not know the transition function or the reward function for the MDP, but instead, we are given with samples of what an agent actually experiences when it interacts with the environment (although, we do know that we do not remain in the same state after taking an action. In this problem, instead of first estimating the transition and reward functions, we will directly estimate the Q function using Q-learning. Assume, the discount factor, y is 0.5 and the step size for Q-learning, a is 0.5. Our current Q function, Q(s,a), is shown in the left figure. The agent encounters the samples shown in the right figure: s' r A B Clockwise 1.501 -0.451 2.73 Counterclockwise 3.153 -6.055 2.133 A Counterclockwise 8.0 Counterclockwise A 0.0 Provide the Q-values for all pairs of (state, action) after both samples have been accounted for. Q4. Model-free Reinforcement Learning: Cycle (20 points) Consider an MDP with 3 states, A, B and C; and 2 actions Clockwise and Counterclockwise. We do not know the transition function or the reward function for the MDP, but instead, we are given with samples of what an agent actually experiences when it interacts with the environment (although, we do know that we do not remain in the same state after taking an action. In this problem, instead of first estimating the transition and reward functions, we will directly estimate the Q function using Q-learning. Assume, the discount factor, y is 0.5 and the step size for Q-learning, a is 0.5. Our current Q function, Q(s,a), is shown in the left figure. The agent encounters the samples shown in the right figure: s' r A B Clockwise 1.501 -0.451 2.73 Counterclockwise 3.153 -6.055 2.133 A Counterclockwise 8.0 Counterclockwise A 0.0 Provide the Q-values for all pairs of (state, action) after both samples have been accounted for

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Question 2. Consider an MDP with 3 states, A. B and C; and 2 actions Clockwise and Counterclockwise. We do not know the transition function or the reward function for the MDP, but instead, we are...

We recommend you work out the solutions to the following questions on a sheet of scratch paper, and then enter your results into the answer boxes. Consider an MDP with 3 states, A, B and C; and 2...

Question 2 Model - Based RL: Cycle Consider an MDP with 3 states, A , B and C; and 2 actions Clockwise and Counterclockwise. We do not know the transition function or the reward function for the MDP...

Consider an MDP with 3 states, A, B and C; and 2 actions Clockwise and Counterclockwise. We do not know the transition function or the reward function for the MDP, but instead, we are given with...

CH A P TER 3 Learning and Motivation Chapter Learning Outcomes After reading this chapter, you should be able to: NEL define learning and describe learning outcomes describe the three stages of...

Part 1 - Cycle. Consider the following transition diagram, transition function and reward func- tion for an MDP. Discount Factor, 9 -0.5 A B -1.0 s a s' Tis,a,s") Rs.1,5) A Clockwise B 1.0 0.0 A...

answer the question clearly You are building a flight-control system for which a convincing safety case must be made. Would you assign the tasks of safety requirements engineering, test case...

Supply Chain Management Introduction Outline What is supply chain management? Significance of supply chain management. Push vs. Pull processes utdallas.edu/~metin 1 A Generic Supply Chain Sources:...

Based on the complaints made by KZN residents over the 'unfair distribution of load shedding it is evident that the quality of service provided by Eskom is not up to standard and it is without a...

Indicate whether each of these objects or molecules has a plane of symmetry: c) Ear b) Pencil a) Idealized human face I CH3 e) f) d) CH3 . Cl Br CH3 "H. CH3 "H g) h) Cl

Ouestien 2 3 - Modale 5 Oul? Biweralty Whing Acsigoment Mmps Help Seve 4 fait Subent Elana relies on her summer income to pay for her next year of college. When she tried to get her old high school...

Compared with half a century ago, adoption has become _ _ _ _ _ _ _ _ _ common, but it is more open and acceptabl e , so we probably discuss it _ _ _ _ _ _ _ . fill in the blanks more or much less or...

Give an example of a Composite Primary Key use in a HCM Payroll Table.

How are Third Normal Form rules disregarded in Dimensional Database Design?

Provide examples of Dimensional Tables.