You are given an environment with one state, X , and two actions, b and c T is the terminal state Your Temporal Difference ( TD ) algorithm generates the following episode using the policy when interacting with its environment Timestep Reward State Action 0 X b 1 1 6 X c 2 1 2 X b 3 1 6 T The policy is given by ( b X ) 0 9 , ( c X ) 0 1 The current values of q are q ( X , b ) 1 and q ( X , c ) 2 The discount factor, 1 2 The step size, 0 1 Show the values of q ( X , b ) and q ( X , c ) after their first update using the following approaches ( a ) 1 step SARSA ( 2 2 ) ( b ) 2 step SARSA ( 2 2 ) ( c ) 2 step Full Tree Backup ( 3 3 ) Note You should update q ( X , b ) and q ( X , c ) only once per learning algorithm Show your work and carry out your calculations to two decimal places

The Answer is in the image, click to view ...

Question: You are given an environment with one state, X , and two actions, b and c . T is the terminal state. Your Temporal Difference

You are given an environment with one state, X

,

and two actions, b and c

.

T is the terminal

state. Your Temporal Difference

(

)

algorithm generates the following episode using the policy when

interacting with its environment:

Timestep Reward State Action

0

X b

1 16

X c

2 12

X b

3 16

The policy is given by:

(

|

) = 0.9, (

|

) = 0.1 .

The current values of q are: q

(

,

) = 1

and q

(

,

) = 2 .

The discount factor,

= 1

2 .

The step size,

= 0.1 .

Show the values of q

(

,

)

and q

(

,

)

after their first update using the following approaches:

(

) 1 -

step SARSA

(2 + 2)

(

) 2 -

step SARSA

(2 + 2)

(

) 2 -

step Full Tree Backup

(3 + 3)

Note: You should update q

(

,

)

and q

(

,

)

only once per learning algorithm. Show your work and carry

out your calculations to two decimal places.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

Describe the DHT algorithms that are used in each of CAN and Pastry to route a request for content to the node that holds that content. [5 marks each] (b) CAN and Pastry include mechanisms to reduce...

(a) A user complains that their web application Times Out after successfully connecting to a remote host, yet, an investigation with ping indicates the remote host is alive. Propose a cause of this...

Let A, B be sets. Define: (a) the Cartesian product (A B) (b) the set of relations R between A and B (c) the identity relation A on the set A [3 marks] Suppose S, T are relations between A and B, and...

Portray in words what transforms you would have to make to your execution to some degree (a) to accomplish this and remark on the benefits and detriments of this thought.You are approached to compose...

State what is meant by a directed graph and a strongly connected component. Illustrate your description by giving an example of such a graph with 8 vertices and 12 edges that has three strongly...

A creative engineer suggests structuring the TLB so that not all the bits of the presented address need match to result in a hit. Suggest how this might be achieved, and what might be the costs and...

Microkernel operating systems aim to address perceived modularity and reliability issues in traditional "monolithic" operating systems. (i) Describe the typical architecture of a microkernel...

Write an alternative definition that is tail-recursive (iterative) and makes use of accumulator variables. [10 marks] Explain why your alternative definition executes more efficiently. [3 marks] 1...

Suppose that R(A, B, C) is a relational schema with functional dependencies F = {A, B C, C B}. (i) Is this schema in 3NF? Explain. [2 marks] (ii) Is this schema in BCNF? Explain. [2 marks] (b)...

Prolog You are approached to compose a Prolog program to work with twofold trees. Your code shouldn't depend on any library predicates and you ought to expect that the mediator is running without...

What is a market analysis? Why is so important to investors in analyzing Highest and Best use relative to new construction, value-add opportunities, or distressed properties? What steps can a...

You have found the following ages (in years) of all 6 lizards at your local zoo: 1, 2, 2, 1, 3, 3 What is the average age of the lizards at your zoo? What is the standard deviation? Round your...

Which of the following financial statements are commonly used in the money management process? Personal Net worth All of the above Balance Sheet Income

It is week 1 and there are currently 82 Apple Watches Series 6 in stock. We need 422 Apple Watches at the start of week 6. If there are scheduled receipts planned for week 3 of 60 Apple Watches and we