Question: a ) Consider an MDP with a single nonterminal state and a single action that transitions back to the nonterminal state with probability p and

a) Consider an MDP with a single nonterminal state and a single action that transitions back to the nonterminal state with probability p and transitions to the terminalstate with probability 1-p. Let the reward be +1 on all transitions, and let y =1. Suppose you observe one episode that lasts 10 steps, with a return of 10. What are thefirst-visit and every-visit estimators of the value of the nonterminal state?
What is the equation analogous for action values Q(s, a) instead of state values V (s), again given returns generated using b?

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!