Question: Question 1: Trajectories, returns, and values (15 points total). This question has six subparts +I right left right 0 le Consider the MDP above, in

Question 1: Trajectories, returns, and values (15 points total). This question

Question 1: Trajectories, returns, and values (15 points total). This question has six subparts +I right left right 0 le Consider the MDP above, in which there are two states, X and Y, two actions, right and left, and the deterministic rewards on each transition are as indicated by the numbers Note that if action right is taken in state X, then the transition may be either to X with a reward of +1 or to Y with a reward of -1. These two possibilities occur with probabilities 2/3 (for the transition to X) and 1/3 (for the transition to state Y) Consider two deterministic policies, 1 and 2: T1(X) n(Y) left 2(X)-right 2(Y-right right (a) (2 pts.) Show a typical trajectory (sequence of states, actions and rewards) from X for policy : (b) (2 pts.) Show a typical trajectory (sequence of states, actions and rewards) from X for policy 2. (c) (2 pts.) Assuming the discount-rate parameter is -0.5, what is the return from the initial state for the second trajectory? T0 (d) (2 pts.) Assuming -0.5, what is the value of state Y under policy ? (e) (2 pts.) Assuming -0.5, what is the action-value of X,left under policy ? (f) (5 pts) Assuming 0.5, what is the value of state X under policy 2? UTi (Y)- q (X, left) = "m (X) = Question 1: Trajectories, returns, and values (15 points total). This question has six subparts +I right left right 0 le Consider the MDP above, in which there are two states, X and Y, two actions, right and left, and the deterministic rewards on each transition are as indicated by the numbers Note that if action right is taken in state X, then the transition may be either to X with a reward of +1 or to Y with a reward of -1. These two possibilities occur with probabilities 2/3 (for the transition to X) and 1/3 (for the transition to state Y) Consider two deterministic policies, 1 and 2: T1(X) n(Y) left 2(X)-right 2(Y-right right (a) (2 pts.) Show a typical trajectory (sequence of states, actions and rewards) from X for policy : (b) (2 pts.) Show a typical trajectory (sequence of states, actions and rewards) from X for policy 2. (c) (2 pts.) Assuming the discount-rate parameter is -0.5, what is the return from the initial state for the second trajectory? T0 (d) (2 pts.) Assuming -0.5, what is the value of state Y under policy ? (e) (2 pts.) Assuming -0.5, what is the action-value of X,left under policy ? (f) (5 pts) Assuming 0.5, what is the value of state X under policy 2? UTi (Y)- q (X, left) = "m (X) =

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

MDPs (6 parts, 50 points total). The following problems take place in various scenarios of the gridworld MDP. In all cases, A is the start state and double-rectangle states are exit states. From an...

13. PLEASE HELP!!! Assignment Submission For this assignment, you submit answers by question parts. The number of submissions remaining for each question part only changes if you submit or change the...

PLZ SOLVE AS MUCH AS POSSIBLE!! APPENDIX TABLE LINK : https://www.webassign.net/mendstat15/mendstat15_appendix_tables.pdf Q13B 3. [12 Points] MendStatlS 13.R.006. The quantitative reasoning scores on...

Q PLZ SOLVE AS MUCH AS POSSIBLE!! APPENDIX TABLE LINK : https://www.webassign.net/mendstat15/mendstat15_appendix_tables.pd 12.5 A researcher devised a scale to measure the freshness of roses that...

QUESTIONS WITH CORRECT ANSWERS ONLY Question: Bilibili Inc (BILI) is an internet gaming company in China. Bear Fitchel and his grandson, Bull Fitchel were discussing BILI's stock value. Bear doesn't...

(c) Suppose there is no operation cost of the insurance rm and the loading factor is zero o = 0, show that agents will have zero savings. (d) Under what conditions about o, would agents choose to...

2.7. Explicit Solutions to Dierential Equations 109 power (kw) 20 15 10 5 0 08:00 10:00 12:00 14:00 16:00 time 18:00 2.7 Explicit Solutions to Dierential Equations In the very rare case in which an...

Set Student Name: 1. Describe the relationship between two variables that have a correlation coefficient value: a. Near -1 b. Near 0 c. Near 1 2. Data was collected where a weightlifter was asked to...

Instuctor's Annotated Edition TENTH EDITION Understandable Statistics Concepts and Methods Charles Henry Brase Regis University Corrinne Pellillo Brase Arapahoe Community College Australia Brazil...

Assume you are in the position of a valuation consulting firm hired by Roxana Garner to help her assess the proper hurdle rate for Langtry Falls projects and the appropriateness of investing in the...

What is the role of SXL and U2AF in generating alternatively spliced tra mRNA transcripts in the Drosophila sex-determination pathway?

If you would be given the chance to experience actual work in a work immersion venue in your barangay suited for your track and strand, what could it be and why do you think you should be deployed...

When a company like LL Technologies decides to issue bonds to fund new research and development projects, what financial obligation does it undertake? Offer company shares to bondholders. Distribute...

Compared with half a century ago, adoption has become _ _ _ _ _ _ _ _ _ common, but it is more open and acceptabl e , so we probably discuss it _ _ _ _ _ _ _ . fill in the blanks more or much less or...

Give an example of a Composite Primary Key use in a HCM Payroll Table.

How are Third Normal Form rules disregarded in Dimensional Database Design?

Provide examples of Dimensional Tables.