Question: consider a reinforcement learning setup, where the agent can take two actions a={0, 1}. There are two states s = {0, 1}, and there is
consider a reinforcement learning setup, where the agent can take two actions a={0, 1}. There are two states s = {0, 1}, and there is no discounting (gamma=1). Over an episode of three time steps, the agent has visited the sequence of state-actions {(0,0), (0,1), (1,0)}. The associated rewards have been {0, -2, 1}. Our previous guess for the value in the state-action pair (0, 0) is Q(0, 0)=0.1, and we are in the second episode. Using Monte Carlo updating, what is Q(0,0)?
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
