Question: Q . 4 Consider the grid environment with six states ( numbered from 1 to 6 ) as shown in Figure 4 . 1 ,

. 4

Consider the grid environment with six states

(

numbered from

1

6)

as shown in Figure

4.1,

with the thick border indicating walls. Suppose that at each state the agent can move up

(

denoted by

.

_1),

right

(

_2),

down

(

_3),

or left

(

_4) .

When the agent moves from state s to state s

^',

it receives a reward of

1

if s

>

^'

and

0

otherwise. For example, the agent receives a reward of

1

when moving from state

4

to state

3 .

Assume that state

1

and state

5

are goal

(

.

.,

terminal

)

states. Let the discount rate

\

gamma be

0.7 . (

)

Suppose that the agent, after taking an action at a state, enters the adjacent state in the direction of the intended action or remains in the same state in the case of a wall

-

collision. For example, at state

2,

if the agent takes action a

_1,

it will remain in state

2

; but if it takes action a

_3

it will enter state

3 .

Determine the values of the Q

-

function for the policy:

\

(2) =

_3, \

(3) =

_4, \

(4) =

_1,

and

\

(6) =

_1 . (10

marks

) (

)

Suppose that the state transitions are now probabilistic, with the probabilities specified as follows. At a non

-

terminal state, after taking an action the agent ends up in the adjacent state in the direction of the intended action

(

or remains in the same state in the case of a wall

-

collision

)

with a probability of

0.8,

and ends up in an adjacent state in the direction perpendicular to the intended direction

(

or remains in the same state in the case of a wallcollision

)

with a probability of

0.1 .

For example, if the agent takes action a

_1

at state

4,

it will end up in state

1

with a probability of

0.8,

or end up at state

3

or state

4

with a probability of

0.1

for each case. Given the initial values of the Q

-

function as shown in Table

4.1,

apply the value iteration algorithm for one iteration that starts with Q

(3,

_1),

to calculate the values of Q

(3,

_1),

(3,

_2),

(3,

_3),

and Q

(3,

_4),

in the order as they are listed.

(

Note: You can omit the rest of the iteration once you have calculated these four values.

) (10

marks

) (

)

Consider the state transition sequence

6

_1 - > 3

_4 - > 4

_1 - > 1,

where the number to the left of an arrow indicates the state while the symbol above the arrow indicates the action taken at that state. Suppose that this sequence is executed in a trial using Q

-

learning

(

with

\

alpha

_

= 1 /

,

and

\

alpha

_0 = 1) .

Assume that the current values of the Q

-

function are all zero. Determine the values of the Q

-

function at the completion of this state transition sequence.

(5

marks

)

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

c++ Overview In this assignment, you will simulate a simple board game. The board is a grid, and starts with a pile of money in each cell. Players take turns rolling four dice to pick a cell, and...

It appears that because of COVID limitations, data was collected virtually through phone, email, and surveys with a population that may have barriers to those methods of data collection. How do you...

chapter 6 \" International Management It was once said that the sun never set on the British Empire. Today, the sun does set on the British Empire, but not on the scores of global empires, including...

Case Summary Read the Discussion Assignment 1-1 on p.24 of the text Winning and Longevity. Select a health care entity to focus on, this could be a clinic or hospital of your choosing. Apply the case...

[Solutions to this assignment must be submitted vio CANVAS prior to midnight on the due dote. These dates and times vory depending on the milestone to be submitted. Submissions up to one day late...

Files 12:56 PM Mon May 16 25% Unit+2+workbook Q . . . Home Insert Draw Layout Review View Calibri Bold (Body) 36 B I U aAv A v Unit 2-Supply and Demand Name: HR: LAW OF SUPPLY Use the space below to...

Please program in java Program-2 DFA State Minimization Backgrounds Deterministic State Acceptors play many important roles in computing applications such as compiler design and regular language...

Analyse the case study related to a South African environmental legal challenge, outlining the legal issues involved and lessons to be learnt from it. This is an application for the review and...

The main purpose of this assignment is to give you practice using two-dimensional arrays, including passing two-dimensional arrays to functions. For Part A, you will add constants and functions...

Create in C Program Consider a two-dimensional K by K grid-structure such as that shown in Figure 1(a). Each grid cell is either colored or uncolored. A cell is colored with probability P and...

How does a credit card differ from a debit card? Explain.

Glorious Florists is a floral supply company with offices and boutiques in Ontario and Qubec. The organization started operations in 1992 and currently has an approximate annual payroll of...

What is the term used to describe the audit strategy and that includes: scope, objectives, resources, and procedures used to evaluate a set of controls and deliver an audit opinion?ams An external...

D'Lite Dry Cleaners is owned and operated by Joel Palk. A building and equipment are currently being rented, pending expansion to new facilities. The actual work of dry cleaning is done by another...