This question calls for a straightforward application of definitions introduced in the Week 6 lecture. Consider...
Fantastic news! We've Found the answer you've been seeking!
Question:
Transcribed Image Text:
This question calls for a straightforward application of definitions introduced in the Week 6 lecture. Consider the MDP shown in the figure below. It has two states: s1 and s2; and three actions: a, b, and c. Action a is deterministic, always leading to state s₁; action b is also deterministic, but always leading to state s2. Action c, on the other hand, keeps the agent in the starting state with probability 1/2, and moves the agent to the other state with probability 1/2. Action a merits a reward of 1 and action b a reward of 2 regardless of the state from which they are taken. Action c yields a reward of 3 if taken from s₁ and a reward of 2 if taken from 82. Observe that all the rewards can be written in terms of the starting state and action alone, with no dependence on the next state. The MDP has a discount factor y = 3/4. 0.5, 3 1, 1 S1 1, 1 0.5, 3 0.5, 2 $2 1, 2 0.5, 2 a b C Y = 3/4 1, 2 Arrows are marked with "probability, reward"; transitions with zero probability are not shown. Consider the policy = "ac", which takes action a from s₁ and action c from s2. What are the improving actions for s₁ and 82 under this policy? In other words, what are IA(ac, s₁) and IA (ac, s2)? Show the working to arrive at your answer. This question calls for a straightforward application of definitions introduced in the Week 6 lecture. Consider the MDP shown in the figure below. It has two states: s1 and s2; and three actions: a, b, and c. Action a is deterministic, always leading to state s₁; action b is also deterministic, but always leading to state s2. Action c, on the other hand, keeps the agent in the starting state with probability 1/2, and moves the agent to the other state with probability 1/2. Action a merits a reward of 1 and action b a reward of 2 regardless of the state from which they are taken. Action c yields a reward of 3 if taken from s₁ and a reward of 2 if taken from 82. Observe that all the rewards can be written in terms of the starting state and action alone, with no dependence on the next state. The MDP has a discount factor y = 3/4. 0.5, 3 1, 1 S1 1, 1 0.5, 3 0.5, 2 $2 1, 2 0.5, 2 a b C Y = 3/4 1, 2 Arrows are marked with "probability, reward"; transitions with zero probability are not shown. Consider the policy = "ac", which takes action a from s₁ and action c from s2. What are the improving actions for s₁ and 82 under this policy? In other words, what are IA(ac, s₁) and IA (ac, s2)? Show the working to arrive at your answer.
Expert Answer:
Related Book For
Probability And Statistics
ISBN: 9780321500465
4th Edition
Authors: Morris H. DeGroot, Mark J. Schervish
Posted Date:
Students also viewed these computer engineering questions
-
Show that entropy can be written in terms of the temperarue and volume in the following form dS = dT + dV, where a and K are thermal expansion coefficient and compressibility factor respectively.
-
Factor A Factor B 1 2 3 4 Xj for Factor B 1 2 3 4 1 2500 2 9 10 6 9 8500 3 14 11 16 12 13250 Xi for Factor A 8333 8000 9000 7333 X8083
-
Consider a Markov chain with two possible states s1 and s2 and with stationary transition probabilities as given in the following transition matrix P: where the value of is unknown (0 1). Suppose...
-
There is a bond on the spot-market. Price is 87.63 USD. Risk-free interest rate is 1.22%. The forward-price is 102.78 USD. Is there any arbitrage possibility if time to maturity is 9 months?
-
Huang Company presented the following data (000). Net income ................................................................2,200,000 Preference shares: 50,000 shares outstanding, 100 par, 8%...
-
Write out the terms of the series and then evaluate it.
-
Cristal Haymeyer, CPA, pays her new staff accountant, Anika, a salary equivalent to \($25\) per hour, while Cristal receives a salary equivalent to \($40\) per hour. The firms predetermined indirect...
-
Product mix, special order. (N. Melumad, adapted) Pendleton Engineering makes cutting tools for metalworking operations. It makes two types of tools: R3, a regular cutting tool, and HP6, a...
-
Show work in terms of time lines or formulas ( No Excel) 4. A investment project generates the following incremental cash inflows over the next 5 years, C = $1.5 million, C = $1.3 million, C3 = $1...
-
The Vintage Restaurant is located on Captiva Island, a resort community near Fort Myers, Florida. The restaurant, which is owned and operated by Karen Payne, just completed its third year of...
-
A 1 160.0 kg car traveling initially with a speed of 25.000 m/s in an easterly direction crashes into the back of a 9 800.0 kg truck moving in the same direction at 20.000 m/s. The velocity of the...
-
(a) A database has eleven transactions running as listed below (the time is shown horizontally from left to right): T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 Time tc tf At time tc a checkpoint is taken, at...
-
Timetable Survey 2. Object Diagram 3. Basic Use Case Diagram Task 2 Draw a basic class diagram for a customer ordering then paying for item/s on an e-commerce website. Use your personal experience to...
-
Assignment 6: Restaurant Please complete the program described below. You will turn in to the dropbox: The code, as a plaintext file (use cat filename.c to print out the file. Highlight it to copy,...
-
Economics 2298 ASG #1 Present Value and Future Value of Annuities Spring 2024 Imagine you are a financial advisor to Valerie VanNess. Ms. VanNess wants to retire in 25 years. She wants to start...
-
A sandy silt soil extends 17 m down from the ground surface. Lab tests show the soil has a total unit weight of 18.8 kN/m3, a saturated unit weight of 20.3 kN/m3 and an effective particle diameter of...
-
Mr. Thomas commenced business on 1 March 2022. During the first month of the operation, the following events and transactions occurred. Date 1 4 14 25 28 28 29 Business Transactions Started business...
-
Independent random samples of sizes n1 = 30 and n2 = 50 are taken from two normal populations having the means 1 = 78 and 2 = 75 and the variances 21 = 150 and 22 = 200. Use the results of Exercise...
-
Prove Theorem 5.3.3. Prove that by applying Taylors theorem with remainder (see Exercise 13 in Sec. 4.2) to the function f(x) = log(1 + x) around x = 0. lim n log(+an)-ancn 0
-
In a simple linear regression problem with the usual improper prior, prove that the conditional mean of 0 given 1 is Use the fact that (0, 1) has a bivariate normal distribution as described in...
-
Consider a problem of multiple linear regression in which a patients reaction Y to a new drug B is to be related to her reaction X1 to a standard drug A and her heart rate x 2 . Suppose that, for all...
-
Please reflect on and explain the role and usefulness of the concept of SD in relation to the protection of the environment.
-
Has the concept of SD achieved the balance between all three pillars: environmental protection; economic development; and social issues?
-
How many dimensions of the PSNR principle can you enumerate? To which one does UN Resolution 1803(1962) refer?
Study smarter with the SolutionInn App