Question: Problem 1 (Optimal Value Functions and Policies) (20 pts): In this problem, we will practice/review the relations between optimal value functions and how to derive

Problem 1 (Optimal Value Functions and Policies) (20 pts): In this problem, we will practice/review the relations between optimal value functions and how to derive optimal policies from optimal value functions. Follow the notations given in the lecture note, or alternatively from Chapter 3 in the book by (Sutton and Barto), answer the following questions. (a) Give an equation for q in terms of the transition probability p(s,rs,a) and the optimal value function v. (Hint: Recall that we have derived the equation for q in terms of the transition probability p and v. What if now we follow the optimal policy instead of just any policy , starting from next state s ?) (b) Give an equation for v in terms of q. (Hint: Use the result in part (a), and the Bellman optimality equation.) (c) Given an equation for the optimal policy (s) in terms of the transition probability p(s,rs,a) and v. For simplicity, we can just consider the deterministic optimal policy here (that is, (s) is one action in each state s ). (Hint: Start from the Bellman optimality equation for v ) (d) Give an equation for the optimal policy (s) in terms of q. Again, consider the deterministic policy case. (Hint: Combine the results in part (a) and (c), or use the result from part (b) directly.)

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related General Management Questions!

explain all parts of the question with step by step 2. There are two goods, food and clothing, whose quantities are denoted by I and y and prices by ps and py, respectively. There is a consumer whose...

alll needed information is given 5. Running Shoes (60 points, 25 minutes). Davis Industries produces a running shoe. Davis has an upstream division that produces leather and a downstream division...

(c) Suppose there is no operation cost of the insurance rm and the loading factor is zero o = 0, show that agents will have zero savings. (d) Under what conditions about o, would agents choose to...

uantitative Analysis BA 452 Homework 3 Questions Homework 3 covers the theory and applications in Lessons I-6 and I-7. This document has four parts: Objectives of doing your homework. Assignment of...

Problems Consider the following linear program: SELF test Max 3A + 28 S.t. 1A + 18 = 10 3A + 18 : 24 14 + 28 = 16 A, B 20 a. Use the graphical solution procedure to find the optimal solution. b....

Please help me with the attached article review. I have attached the instructions as well as the scholarly article to use. The topic is Time Value of Money Analysis. Time value of money analysis...

ECO670...... 1. "Deriving the Envelope Thoerem: Consider the more general problem M(o, ) = max, f(x, a, 8) subject to g(r, a, 8) = 0. Show that: dM(a, B) _ Of(x', a, b) ag(x*, a, b) da da da 2. For...

ACHE HEALTHCARE EXECUTIVE 2016 COMPETENCIES ASSESSMENT TOOL T he American College of Healthcare Executives Healthcare Executive Competencies Assessment Tool is offered as an instrument for healthcare...

Solving Two-stage Robust Optimization Problems by A Constraint-and-Column Generation Method Bo Zeng Department of Industrial and Management Systems Engineering University of South Florida, Email:...

2 a) Given utility function U (C; R) = min (C; 2R); daily endowment of time 24h, price and wage pc= w = 1; nd optimal choice of C; relaxation time R and labor supply L. (three numbers, use secrets of...

Because Vaughn operates in a very competative environment,it is essential for the management team to understand its cost structure and how it affects the company's profitability.For one of its key...

All too often we explain problems or successes in organizations with the acts of one or several of the leaders at the top. It is never that simple. Yet leaders make a difference, and when they...

You are considering buy an asset that generates a initial cash flow of $ 3 0 0 0 in 6 year's time. The cash flow then grows at a rate of 4 % p . a . compounded annually for 9 years. If similar...

Which of the following terms represents the interactions between all of the companies that influence each other in an environment? Organizational entropy Organizational partnerships Organizational...

5. Structure your speech to make it easy to listen to

1. LaunchPad for Real Communication offers key term videos and encourages selfassessment through adaptive quizzing. Go to bedfordstmartins.com/realcomm to get access to: LearningCurve Adaptive...

1. Describe the goals of informative speaking