Question: consider a reinforcement learning setup, where the agent can take two actions a={0, 1}. There are two states s = {0, 1}, and there is
consider a reinforcement learning setup, where the agent can take two actions a={0, 1}. There are two states s = {0, 1}, and there is no discounting (gamma=1). Over an episode of three time steps, the agent has visited the sequence of state-actions {(1,1), (0,1), (0,0)}. The associated rewards have been {1, -1, 1}. Our previous guess for the value in the state-action pair (1, 0) is Q(1, 0)=0.125, and we are in the second episode. Using Monte Carlo updating, what is Q(0,0)?
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
