Question: Problem 2 . ( 3 0 pt ) Given a Markov stationary policy , we studied the minimization of the projected Bellman error for policy

Problem 2.(30pt) Given a Markov stationary policy , we studied the minimization of the
projected Bellman error for policy evaluation via function approximation. Alternatively, we
can choose the objective function as
J()=12sinS?(s)|v(s)-v(s;)|2,
where in(S) is the stationary distribution of the Markov chain induced by and v(*;) is
the approximation of v with the parameter inRd. Then, the gradient of J() with respect to
is given by
gradJ()=Es[(v(s)-v(s;))*gradv(s;)].
To find approximating v(s), we can apply the stochastic gradient method according to
k+1=k+(v(sk)-v(sk;k))*gradv(sk;k)
where skinS denotes the current state at stage k.
(a) Show that with direct parametrization, i.e.,v=, the update (**) reduces to
n-step TD learning algorithm if we use v(st)~~Gt(n),
Monte Carlo method if we use v(st)~~Gt, where Gt=limnGt(n),
-return update if we use v(st)~~Gt.
Recall the indicator function I{st)=s in these (non-parametric) updates.
(b) The direct parameterization can be viewed as linear function approximation with the fea-
ture matrix IinR|S||S|. What if we have the feature matrix
=[1dotsd]=[hat()T(s0)vdotshat()T(s|S|)]inR|S|d,
where iinR|S| and hat()(s)inRd. We have d|S| and is full column rank. Formulate the
counterparts of n-step TD learning, Monte Carlo method, and -return algorithms based on
(**) under linear function approximation according to the feature matrix .
 Problem 2.(30pt) Given a Markov stationary policy , we studied the

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!