Question: Problem 1. (50pt) Given a Markov stationary policy , consider the policy evaluation problem to compute v. For example, we can apply the temporal difference


Problem 1. (50pt) Given a Markov stationary policy , consider the policy evaluation problem to compute v. For example, we can apply the temporal difference (TD) learning algorithm given by vt+1(s)=vt(s)+t(s)I{st=s}, where t:=rt+vt(st+1)vt(st) is known as TD error. Alternatively, we can apply the n-step TD learning algorithm given by vt+1(s)=vt(s)+(Gt(n)vt(s))I{st=s}, where Gt(n):=rt+rt+1++n1rt+n1+nvt(st+n) for n=1,2,. Note that t= Gt(1)vt(st). The n-step TD algorithms for n
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
