Question: Q . 4 Consider the grid environment with six states ( numbered from 1 to 6 ) as shown in Figure 4 . 1 ,

Q.4 Consider the grid environment with six states (numbered from 1 to 6) as shown in Figure 4.1, with the thick border indicating walls. Suppose that at each state the agent can move up (denoted by .a_1), right (a_2), down (a_3), or left (a_4). When the agent moves from state s to state s^', it receives a reward of 1 if s>s^' and 0 otherwise. For example, the agent receives a reward of 1 when moving from state 4 to state 3. Assume that state 1 and state 5 are goal (i.e., terminal) states. Let the discount rate \gamma be 0.7.(a) Suppose that the agent, after taking an action at a state, enters the adjacent state in the direction of the intended action or remains in the same state in the case of a wall-collision. For example, at state 2, if the agent takes action a_1, it will remain in state 2 ; but if it takes action a_3 it will enter state 3. Determine the values of the Q-function for the policy: \pi (2)= a_3,\pi (3)=a_4,\pi (4)=a_1, and \pi (6)=a_1.(10 marks)(b) Suppose that the state transitions are now probabilistic, with the probabilities specified as follows. At a non-terminal state, after taking an action the agent ends up in the adjacent state in the direction of the intended action (or remains in the same state in the case of a wall-collision) with a probability of 0.8, and ends up in an adjacent state in the direction perpendicular to the intended direction (or remains in the same state in the case of a wallcollision) with a probability of 0.1. For example, if the agent takes action a_1 at state 4, it will end up in state 1 with a probability of 0.8, or end up at state 3 or state 4 with a probability of 0.1 for each case. Given the initial values of the Q-function as shown in Table 4.1, apply the value iteration algorithm for one iteration that starts with Q(3, a_1), to calculate the values of Q(3, a_1), Q(3, a_2), Q(3, a_3), and Q(3, a_4), in the order as they are listed. (Note: You can omit the rest of the iteration once you have calculated these four values.)(10 marks)(c) Consider the state transition sequence 6 a_1->3 a_4->4 a_1->1, where the number to the left of an arrow indicates the state while the symbol above the arrow indicates the action taken at that state. Suppose that this sequence is executed in a trial using Q-learning (with \alpha _k=1/ k, and \alpha _0=1). Assume that the current values of the Q-function are all zero. Determine the values of the Q-function at the completion of this state transition sequence. (5 marks)

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!