Question: 1 . Q - Learning [ 3 5 Points ] This time, although the Gridworld looks similar, it is not an MDP anymore. That means,

1 .

-

Learning

[35

Points

]

This time, although the Gridworld looks similar, it is not an MDP anymore. That means, the only information you get from the game object is game.get

_

actions

(

state: State

),

which returns a set of Actions. All other methods have been removed.

You will also see that the GUI adds a new episode label. Each iteration is a single update, and each episode is a collections of iterations such that the agent starts from the starting state at the beginning of an episode and reaches a terminal state at the end of an episode.

A stub of a Q

-

learner is specified in QLearningAgent. A QLearningAgent takes in

game, an object to get a set of available actions for a state ss

.

discount, the discount factor

\

gamma

\

gamma

.

learning

_

rate, the learning rate

\

alpha

\

alpha

.

explore

_

prob, the probability of exploration

for each iteration.

Note that the states in Gridworld aren

t given as well. Your code should be able to deal with a new state and a new

(

state

,

action

)

pair.

Then you need to implement the following methods for this part:

agent.get

_

_

value

(

state

,

action

)

returns the Q

-

value for

(

state

,

action

)

pair. Q

(

,

)

(

,

)

from Q

-

table. For a never seen pair, the Q

-

value should be

00 .

agent.get

_

value

(

state

)

returns the value of the state V

(

)

(

) .

agent.get

_

best

_

policy

(

state

)

returns the best policy of the state,

\

(

) \

(

) .

agent.update

(

state

,

action, next

_

state, reward

)

updates the Q

-

value for

(

state

,

action

)

pair, based on the value of the next

_

state and reward given.

Note: For get

_

best

_

policy, you should break ties randomly for better behavior. The random.choice

()

function will help.

For more instructions, refer to the lecture slides and comments in the skeleton file.

Important: Make sure that in your get

_

value and get

_

best

_

policy functions, you only access Q

-

values by calling get

_

_

value. This abstraction will be useful for the Approximate Q

-

learning you will implement later when you override get

_

_

value to use features of state

-

action pairs rather than state

-

action pairs directly.

With the Q

-

learning update in place, you can watch your Q

-

learner learn under manual control, using the keyboard arrow keys:

python gridworld.py

Hint: to help

# template

class QLearningAgent:

" " "

Implement Q Reinforcement Learning Agent using Q

-

table."""

def

__

init

__(

self

,

game, discount, learning

_

rate, explore

_

prob

)

" " "

Store any needed parameters into the agent object.

Initialize Q

-

table.

" " "

. . .

# TODO

def get

_

_

value

(

self

,

state, action

)

" " "

Retrieve Q

-

value from Q

-

table.

For an never seen

(

,

)

pair, the Q

-

value is by default

0 .

" " "

return

0

# TODO

def get

_

value

(

self

,

state

)

" " "

Compute state value from Q

-

values using Bellman Equation.

(

) =

max

_

a Q

(

,

)

" " "

return

0

# TODO

def get

_

best

_

policy

(

self

,

state

)

" " "

Compute the best action to take in the state using Policy Extraction.

\

(

) =

argmax

_

a Q

(

,

)

If there are ties, return a random one for better performance.

Hint: use random.choice

() .

" " "

return None # TODO

def update

(

self

,

state, action, next

_

state, reward

)

" " "

Update Q

-

values using running average.

(

,

) = (1 - \

alpha

)

(

,

) + \

alpha

(

+ \

gamma V

(

'))

Where

\

alpha is the learning rate, and

\

gamma is the discount.

Note: You should not call this function in your code.

" " "

. . .

# TODO

2 .

Epsilon Greedy

def get

_

action

(

self

,

state

)

" " "

Compute the action to take for the agent, incorporating exploration.

That is

,

with probability

\

epsi

,

act randomly.

Otherwise, act according to the best policy.

Hint: use random.random

() < \

epsi to check if exploration is needed.

" " "

return None # TODO

3 .

Bridge Crossing Revisited

def question

3 ()

epsilon

= . . .

learning

_

rate

= . . .

return epsilon, learning

_

rate

# If not possible, return 'NOT POSSIBLE'

5 .

Approximate Q

-

Learning

class ApproximateQAgent

(

QLearningAgent

)

" " "

Implement Approximate Q Learning Agent using weights."""

def

__

init

__(

self

, *

args

,

extractor

)

" " "

Initialize parameters and store the feature extractor.

Initialize weights table."""

super

() .__

init

__(*

args

)

. . .

# TODO

def get

_

weight

(

self

,

feature

)

" " "

Get weight of a feature.

Never seen feature should have a weight of

0 .

" " "

return

0

# TODO

def get

_

_

value

(

self

,

state, action

)

" " "

Compute Q value based on the dot product of feature components and weights.

(

,

) =

_1 *

_1 (

,

) +

_2 *

_2 (

,

) + . . . +

_

*

_

(

,

)

" " "

return

0

# TODO

def update

(

self

,

state, action, next

_

state, reward

)

" " "

Update weights using least

-

squares approximation.

\

Delta

=

+ \

gamma V

(

') -

(

,

)

Then update weights: w

_

=

_

+ \

alpha

* \

Delta

*

_

(

,

)

" " "

. . .

# TODO

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

lup ] (d) Show how a generating (or "mother") wavelet (x) can spawn a family of "daughter" wavelets jk(x) by simple shifting and scaling operations, and explain the advantages of representing...

Need help finishing this assignment in term of comparing companies. Document will be provided ASSIGNMENT 2 - ACC/ACF5903 SEMESTER 1, 2017 This task is to be completed as a GROUP and accounts for 25%...

Question 1 (5 points) For which sets of electrodes are we determining the Shape of the electric field? : One point charge (monopole) : Two plates : Two point charges (dipole) : One point charge, one...

Course: Capital Budgeting The problem is labeled "current" and the two other attached files are previous questions leading up to this one with the solution to get an example of how things need to be...

Esther's Hypotheses 1. There are stark differences between male and female customers and operators. a. We have identified most of the difference between male and female operators (Thanks Juan)....

Process Strategy Chapter 3 Copyright 2013 Pearson Education, Inc. publishing as Prentice Hall 03 -01 What is Process Strategy? Process Strategy The pattern of decisions made in managing processes, so...

KYC's stock price can go up by 15 percent every year, or down by 10 percent. Both outcomes are equally likely. The risk free rate is 5 percent, and the current stock price of KYC is 100. (a) Price a...

What is Exercise 32-45? I don't quite understand it, so I'm not sure if my work or answers are correct. Linear Algebra Lab Exercises Linear algebra deals with vectors and linear functions that act on...

Question 1 5 Points Mango, which originated in Spain, is one of the pioneers in the fast fashion strategic group within the fashion industry. Mango currently has 1,220 stores in 91 countries....

I have attached the question. I will post student question when I receive one later. Chapter 2, Customer Behavior and 3, Segmentation of textbook can also be used. Marketing Management: MKT500 Week 1...

Janson Company prepares an income statement for financial accounting purposes using the traditional income statement format, as well as an income statement for managerial accounting purposes using...

Airplane takeoff from an aircraft carrier is assisted by a steam driven piston/cylinder device with an average pressure of 1250 kPa. A 17500 kg airplane should be accelerated from zero to a speed of...

Mcolontire Pirpones? 0 yem 3 0 ymins 3 y yean 2 5 yean A company recently acquired a copyright that now has a remaining legal life of 3 0 years. The copyright initially had a 3 8 - year useful life...

What is this firms 2-year cumulative probability of default? Enter the result in percentage points (For example, if you find 10.00% or 0.1, write 10 as the answer), round your answer to 2 decimal...

c. What were you expected to do when you grew up?

4. Describe how cultural values influence communication.

3. Identify and describe nine cultural value orientations.