Question: Q . 4 Consider the grid environment with six states ( numbered from 1 to 6 ) as shown in Figure 4 . 1 ,
Q Consider the grid environment with six states numbered from to as shown in Figure with the thick border indicating walls. Suppose that at each state the agent can move up denoted by a right a down a or left a When the agent moves from state s to state s it receives a reward of if ss and otherwise. For example, the agent receives a reward of when moving from state to state Assume that state and state are goal ie terminal states. Let the discount rate gamma be a Suppose that the agent, after taking an action at a state, enters the adjacent state in the direction of the intended action or remains in the same state in the case of a wallcollision. For example, at state if the agent takes action a it will remain in state ; but if it takes action a it will enter state Determine the values of the Qfunction for the policy: pi api api a and pi a marksb Suppose that the state transitions are now probabilistic, with the probabilities specified as follows. At a nonterminal state, after taking an action the agent ends up in the adjacent state in the direction of the intended action or remains in the same state in the case of a wallcollision with a probability of and ends up in an adjacent state in the direction perpendicular to the intended direction or remains in the same state in the case of a wallcollision with a probability of For example, if the agent takes action a at state it will end up in state with a probability of or end up at state or state with a probability of for each case. Given the initial values of the Qfunction as shown in Table apply the value iteration algorithm for one iteration that starts with Q a to calculate the values of Q a Q a Q a and Q a in the order as they are listed. Note: You can omit the rest of the iteration once you have calculated these four values. marksc Consider the state transition sequence a a a where the number to the left of an arrow indicates the state while the symbol above the arrow indicates the action taken at that state. Suppose that this sequence is executed in a trial using Qlearning with alpha k k and alpha Assume that the current values of the Qfunction are all zero. Determine the values of the Qfunction at the completion of this state transition sequence. marks
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
