Question: Problem 3 (Value Iteration Using Action Value Function) ( 40pts ): Follow the notations given in the lecture note, or alternatively from Chapter 4 in

Problem 3 (Value Iteration Using Action Value Function) ( 40pts ): Follow the notations given in the lecture note, or alternatively from Chapter 4 in the book by (Sutton and Barto), answer the following questions. (a) Combine the results from Problem 1 part (a) and part (b), give the Bellman optimality equation for q.. (Hint: Use different notations a and a, to differentiate between the action taken in the current state s and the next state s.) Problem 1 (a)(b) as following: (a) Give an equation for q. in terms of the transition probability p(s,rs,a) and the optimal value function v. (Hint: Recall that we have derived the equation for q in terms of the transition probability p and v. What if now we follow the optimal policy , instead of just any policy , starting from next state s ?) (b) Give an equation for v. in terms of q.. (Hint: Use the result in part (a), and the Bellman optimality equation.) (b) The main process of value iteration consists of turning the Bellman optimality equation into an iterative process. Now, given an (approximate) action value function qk(s,a), what is the equation for obtaining the next iterate, qk+1(s,a), using the Bellman optimality equation for q. ? (c) Use the results from part (b), write the value iteration pseudocode (refer to lecture note or Chapter 4.4 in the book) using the action value function Q(s,a) instead of the value function V(s). (Hint: The main process can be rewritten using part (b). Use part (d) in Problem 1 for outputting a deterministic policy in the last step.)

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related General Management Questions!

Problem 2 (Policy Iteration Using Action Value Function) (40 pts): Follow the notations given in the lecture note, or alternatively from Chapter 4 in the book by (Sutton and Barto), answer the...

Problem 2 (The Value Functions) (40 pts): Follow the notations given in the lecture note, or alternatively from Chapter 3.5 in the book by (Sutton and Barto), answer the following questions. (a) Give...

Problem 1 (Optimal Value Functions and Policies) (20 pts): In this problem, we will practice/review the relations between optimal value functions and how to derive optimal policies from optimal value...

can any expert help me with the below assignment just only 2 questions ,please read the requirment word by word and be careful about the style of the reference(APA) , file 3.1 is for question...

just only 2 questions ,please read the requirment word by word and be careful about the style of the reference(APA) , file 3.1 is for question 2..BTW,please focus the every step FINANCE SPOTLIGHT...

CONTENTS COVID SAFETY ON CAMPUS 2 WHAT IS THE UNIT ABOUT? 3 UNIT DESCRIPTION 3 INTENDED LEARNING OUTCOMES 3 ACKNOWLEDGEMENT OF COUNTRY 3 GRADUATE STATEMENT 4 ALTERATIONS TO THE UNIT AS A RESULT OF...

Hi, This subject is financial accounting, here is a short essay type question, approximately 5 paragraphs. ''Drawing on private interest theory, what powers do you believe the Australian Accounting...

PGA Tour, Inc. v. Martin, 532 U.S. 661 (ccu.edu) found in Nexis Uni, case. Jennings, M. (2016). Foundations of legal environment (3rd ed.). Chapter 18: Business and Employees - Employment Regulation...

Hello. Can you help with this four question problem? Case PRIVATE EQUITY CASE: MERGER CONSOLIDATION The questions below COMBINE the Ohio & Maryland PT acquisitions as if they are a single c Learning...

Chapter 3 Describe and explain the terms climate sensitivity, fat tail, and low beta, and how they relate to climate change. What are the estimated economic costs of global warming? Can we trust...

Donnegal Company makes and sells artistic frames for pictures. The controller is responsible for preparing the master budget and has accumulated the following information for 2014. Donnegal has a...

Dump of assembler code for function phase_4: => 0x0000000000401040 : 0x0000000000401044 : 0x0000000000401049 : 0x000000000040104e : 0x0000000000401053 : 0x0000000000401058 : 0x000000000040105d : cmp...

By design and formula, when does a participating employees group life insurance protection under a plan without portability usually stop or it s death benefit becomes significantly reduced

P-1) (100 Pts.) A chemical manufacturing company (CMC) has a contract for the procurement of the neccssaly chemicals from four suppliers. The chemicals purchased from Supplier A are priced at $20...

(Appendices) ENDING INVENTORY, COST OF GOODS SOLD, AND GROSS MARGIN. Wilson Company sells a single product. At the beginning of the year, Wilson had 120 units in stock at a cost of $8 each. During...

(Appendices) SALES RETURNS. Swan and Bloom, Inc., is a wholesaler of novelty items to small stores. All sales are on credit with no discount offered. During March, Swan and Bloom accepted the...

(Appendices) INTERNAL CONTROL FOR SALES. Alcoa Building Products distributes aluminum siding and related building products to building contractors, all of whom purchase on credit. LO6 REQUIRED: List...