Question: Q4. Model-free Reinforcement Learning: Cycle (20 points) Consider an MDP with 3 states, A, B and C; and 2 actions Clockwise and Counterclockwise. We do

Q4. Model-free Reinforcement Learning: Cycle (20 points) Consider an MDP with

Q4. Model-free Reinforcement Learning: Cycle (20 points) Consider an MDP with 3 states, A, B and C; and 2 actions Clockwise and Counterclockwise. We do not know the transition function or the reward function for the MDP, but instead, we are given with samples of what an agent actually experiences when it interacts with the environment (although, we do know that we do not remain in the same state after taking an action). In this problem, instead of first estimating the transition and reward functions, we will directly estimate the Q function using Q-learning. Assume, the discount factor, y is 0.5 and the step size for Q-learning, a is 0.5. Our current Q function, Q(s,a), is shown in the left figure. The agent encounters the samples shown in the right figure: B a S r s' Clockwise 1.501 -0.451 2.73 Counterclockwise 3.153 -6.055 2.133 A Counterclockwise 8.0 Counterclockwise A 0.0 Provide the Q-values for all pairs of (state, action) after both samples have been accounted for. Q4. Model-free Reinforcement Learning: Cycle (20 points) Consider an MDP with 3 states, A, B and C; and 2 actions Clockwise and Counterclockwise. We do not know the transition function or the reward function for the MDP, but instead, we are given with samples of what an agent actually experiences when it interacts with the environment (although, we do know that we do not remain in the same state after taking an action). In this problem, instead of first estimating the transition and reward functions, we will directly estimate the Q function using Q-learning. Assume, the discount factor, y is 0.5 and the step size for Q-learning, a is 0.5. Our current Q function, Q(s,a), is shown in the left figure. The agent encounters the samples shown in the right figure: B a S r s' Clockwise 1.501 -0.451 2.73 Counterclockwise 3.153 -6.055 2.133 A Counterclockwise 8.0 Counterclockwise A 0.0 Provide the Q-values for all pairs of (state, action) after both samples have been accounted for

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Question 2. Consider an MDP with 3 states, A. B and C; and 2 actions Clockwise and Counterclockwise. We do not know the transition function or the reward function for the MDP, but instead, we are...

We recommend you work out the solutions to the following questions on a sheet of scratch paper, and then enter your results into the answer boxes. Consider an MDP with 3 states, A, B and C; and 2...

Question 2 Model - Based RL: Cycle Consider an MDP with 3 states, A , B and C; and 2 actions Clockwise and Counterclockwise. We do not know the transition function or the reward function for the MDP...

Consider an MDP with 3 states, A, B and C; and 2 actions Clockwise and Counterclockwise. We do not know the transition function or the reward function for the MDP, but instead, we are given with...

CH A P TER 3 Learning and Motivation Chapter Learning Outcomes After reading this chapter, you should be able to: NEL define learning and describe learning outcomes describe the three stages of...

Part 1 - Cycle. Consider the following transition diagram, transition function and reward func- tion for an MDP. Discount Factor, 9 -0.5 A B -1.0 s a s' Tis,a,s") Rs.1,5) A Clockwise B 1.0 0.0 A...

answer the question clearly You are building a flight-control system for which a convincing safety case must be made. Would you assign the tasks of safety requirements engineering, test case...

Supply Chain Management Introduction Outline What is supply chain management? Significance of supply chain management. Push vs. Pull processes utdallas.edu/~metin 1 A Generic Supply Chain Sources:...

1. Under which set of auditing standards is the auditor required to express an opinion on compliance with laws and regulations applicable to each major program? a. AICPA standards. b. GAO standards....

Techware Incorporated is considering the introduction of two new software products to the market. In particular, the company has four options regarding these two proposed products: introduce neither...

Appen Sai Resort has high revenue from November to February, moderate reverue in Octobec. March and Aprl: and low revenue from May to September. If the Aupen 5 4 1 Rewort wants to renovate their...

Seved Help 14 Wisconsin Snowmobile Corp. is considering a switch to level production Cost efficiencies would occur under level production, and aftertax costs would decline by $31,500, but inventory...

2. I reject the harm that can result from reacting emotionally when I am upset and getting angry or feeling battered.

7. I understand that victims of my outbursts will remember my accusatory statements and name-calling long after I have calmed down.

6. I resist the temptation to feel entitled to better treatment and to lose emotional control.