Question: You are given an environment with one state, X , and two actions, b and c . T is the terminal state. Your Temporal Difference
You are given an environment with one state, X and two actions, b and c T is the terminal
state. Your Temporal Difference TD algorithm generates the following episode using the policy when
interacting with its environment:
Timestep Reward State Action
X b
X c
X b
T
The policy is given by: bXcX
The current values of q are: qX b and qX c
The discount factor,
The step size,
Show the values of qX b and qX c after their first update using the following approaches:
astep SARSA
bstep SARSA
cstep Full Tree Backup
Note: You should update qX b and qX c only once per learning algorithm. Show your work and carry
out your calculations to two decimal places.
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
