Question: Problem 1 . ( 5 0 pt ) Given a Markov stationary policy , consider the policy evaluation problem to compute v . For example,

Problem 1.(50pt) Given a Markov stationary policy , consider the policy evaluation problem
to compute v. For example, we can apply the temporal difference (TD) learning algorithm
given by
vt+1(s)=vt(s)+t(s)*I{st)=s,
where t:=rt+vt(st+1)-vt(st) is known as TD error. Alternatively, we can apply the n-step
TD learning algorithm given by
vt+1(s)=vt(s)+(Gt(n)-vt(s))*I{st)=s,
where Gt(n):=rt+rt+1+dots+n-1rt+n-1+nvt(st+n) for n=1,2,dots. Note that t=
Gt(1)-vt(st).
The n-step TD algorithms for n use bootstrapping. Therefore, they use biased estimate
of v. On the other hand, as n, the n-step TD algorithm becomes a Monte Carlo method,
where we use an unbiased estimate of v. However, these approaches delay the update for n
stages and we update the value function estimate only for the current state.
As an intermediate step to address these challenges, we first introduce the -return algo-
rithm given by
vt+1(s)=vt(s)+(Gt-vt(s))*I{st)=s,
where given in[0,1], we define Gt:=(1-)n=1n-1Gt(n) taking a weighted average of
Gt(n)'s.
(a) By the definition of Gt(n), we can show that Gt(n)=rt+Gt+1(n-1). Derive an analogous
recursive relationship for Gt and Gt+1.
(b) Show that the term Gt-vt(s) in the -return update can be written as the sum of TD errors.
The TD algorithm, Monte Carlo method and -return algorithm looks forward to approx-
imate v. Alternatively, we can look backward via the eligibility trace method. TheTD()
 Problem 1.(50pt) Given a Markov stationary policy , consider the policy

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!