Question: consider a reinforcement learning setup, where the agent can take two actions a={0, 1}. There are two states s = {0, 1}, and there is
consider a reinforcement learning setup, where the agent can take two actions a={0, 1}. There are two states s = {0, 1}, and there is no discounting (gamma=1). Over an episode of three time steps, the agent has visited the sequence of state-actions {(1,1), (0,1), (0,0)}. The associated rewards have been {1, -1, 1}. Our previous guess for the value in the state-action pair (1, 0) is Q(1, 0)=0.125, and we are in the second episode. We follow "first-visit" Monte Carlo. Given the new experience from the episode, we would have that (choose 1 in below):
a) Q(1,0) = 0
b) Q(1,0) = 0.25
c) Q(1,0) = 0.125
d) none of above
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
