Consider the following deterministic MDP with 1-dimensional continuous states and actions and a finite task horizon: State

Question:

Consider the following deterministic MDP with 1-dimensional continuous states and actions and a finite task horizon:

State Space S: R

Action Space A: R

Reward Function: R(s, a, s') = −qs2 − ra2 where r > 0 and q ≥ 0 Deterministic Dynamics/Transition Function: s' = cs + da (i.e., the next state s' is a deterministic function of the action a and current state s)

Task Horizon: T ∈ N

Discount Factor: γ = 1 (no discount factor)

Hence, we would like to maximize a quadratic reward function that rewards small actions and staying close to the origin. In this problem, we will design an optimal agent πt and also solve for the optimal agent’s value function Vt for all timesteps.

By induction, we will show that Vt is quadratic. Observe that the base case t = 0 trivially holds because V0 (s) = 0 For all parts below, assume that Vt (s) = −pts2 (Inductive Hypothesis). 

a. (i) Write the equation for V t+1(s) as a function of s, q, r, a, c, d, and pt . If your expression contains max, you do not need to simplify the max. 

(ii) Now, solve for π t+1(s). Recall that you can find local maxima of functions by computing the first derivative and setting it to 0. 

b. Assume π t+1 = kt+1s for some kt+1 ∈ R. Solve for pt+1 in V t+1(s) = −pt+1s2.

Fantastic news! We've Found the answer you've been seeking!

Step by Step Answer:

Related Book For  book-img-for-question
Question Posted: