Question: At state x , with probability 1 the state transits to y1 , i.e., P(y1|x)=1. Then at state y1 , we have P(y1|y1)=p,P(y2|y1)=1p, which says
At state x , with probability 1 the state transits to y1 , i.e.,
P(y1|x)=1.
Then at state y1 , we have
P(y1|y1)=p,P(y2|y1)=1p,
which says there is probability p we stay in y1 and probability 1p the state transits to y2 . Finally, state y2 is the absorbing state so that
P(y2|y2)=1.
The instant reward is set as 1 for starting in state y1 and 0 elsewhere:
R(y1,a,y1)=1,R(y1,a,y2)=1,,R(s,a,s)=0 otherwise.
The discount factor is denoted by ( 0<<1 ).
My problem is defining this with p and 1-p . It confuses me. I know how to do Bellman equations when they involve the usual T, R and V* .
This is the question:
Define V(y1) as the optimal value function of the state y1 . Compute V(y1) via Bellman's Equation. (The answer is a formula in terms of ,p ).
V(y1)=
Find Q(x,a) .
Q(x,a)=
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
