Question: 77. In a Markov decision problem, another criterion often used, different than the expected average return per unit time, is that of the expected discounted

77. In a Markov decision problem, another criterion often used, different than the expected average return per unit time, is that of the expected discounted return. In this criterion we choose a numberI mage, and try to choose a policy so as to maximize

(that is, rewards at time n are discounted at rate Image). Suppose that the initial state is chosen according to the probabilitiesI mage. That is, Image For a given policy β letI mage denote the expected discounted time that the process is in state j and action a is chosen. That is, Image where for any event A the indicator variableI mage is defined by Image

(a) Show that Image or, in other words, Image is the expected discounted time in state j under β.

(b) Show that

(4.38)

Image Hint: For the second equation, use the identity Image Take expectations of the preceding to obtain

(c) Let Image be a set of numbers satisfying Image Argue that Image can be interpreted as the expected discounted time that the process is in state j and action a is chosen when the initial state is chosen according to the probabilities Image and the policy β, given by Image is employed.

Hint: Derive a set of equations for the expected discounted times when policy β is used and show that they are equivalent to Eq.

(4.38).

(d) Argue that an optimal policy with respect to the expected discounted return criterion can be obtained by first solving the linear program Image and then defining the policy Image by Image where the Image are the solutions of the linear program.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Introduction To Probability Statistics Questions!