Question: 1 . Q - Learning [ 3 5 Points ] This time, although the Gridworld looks similar, it is not an MDP anymore. That means,

1. Q-Learning [35 Points]
This time, although the Gridworld looks similar, it is not an MDP anymore. That means, the only information you get from the game object is game.get_actions(state: State), which returns a set of Actions. All other methods have been removed.
You will also see that the GUI adds a new episode label. Each iteration is a single update, and each episode is a collections of iterations such that the agent starts from the starting state at the beginning of an episode and reaches a terminal state at the end of an episode.
A stub of a Q-learner is specified in QLearningAgent. A QLearningAgent takes in
game, an object to get a set of available actions for a state ss.
discount, the discount factor \gamma \gamma .
learning_rate, the learning rate \alpha \alpha .
explore_prob, the probability of exploration for each iteration.
Note that the states in Gridworld arent given as well. Your code should be able to deal with a new state and a new (state, action) pair.
Then you need to implement the following methods for this part:
agent.get_q_value(state, action) returns the Q-value for (state, action) pair. Q(s,a)Q(s,a) from Q-table. For a never seen pair, the Q-value should be 00.
agent.get_value(state) returns the value of the state V(s)V(s).
agent.get_best_policy(state) returns the best policy of the state, \pi (s)\pi (s).
agent.update(state, action, next_state, reward) updates the Q-value for (state, action) pair, based on the value of the next_state and reward given.
Note: For get_best_policy, you should break ties randomly for better behavior. The random.choice() function will help.
For more instructions, refer to the lecture slides and comments in the skeleton file.
Important: Make sure that in your get_value and get_best_policy functions, you only access Q-values by calling get_q_value. This abstraction will be useful for the Approximate Q-learning you will implement later when you override get_q_value to use features of state-action pairs rather than state-action pairs directly.
With the Q-learning update in place, you can watch your Q-learner learn under manual control, using the keyboard arrow keys:
python gridworld.py
Hint: to help
# template
class QLearningAgent:
"""Implement Q Reinforcement Learning Agent using Q-table."""
def __init__(self, game, discount, learning_rate, explore_prob):
"""Store any needed parameters into the agent object.
Initialize Q-table.
"""
... # TODO
def get_q_value(self, state, action):
"""Retrieve Q-value from Q-table.
For an never seen (s,a) pair, the Q-value is by default 0.
"""
return 0 # TODO
def get_value(self, state):
"""Compute state value from Q-values using Bellman Equation.
V(s)= max_a Q(s,a)
"""
return 0 # TODO
def get_best_policy(self, state):
"""Compute the best action to take in the state using Policy Extraction.
\pi (s)= argmax_a Q(s,a)
If there are ties, return a random one for better performance.
Hint: use random.choice().
"""
return None # TODO
def update(self, state, action, next_state, reward):
"""Update Q-values using running average.
Q(s,a)=(1-\alpha ) Q(s,a)+\alpha (R +\gamma V(s'))
Where \alpha is the learning rate, and \gamma is the discount.
Note: You should not call this function in your code.
"""
... # TODO
# 2. Epsilon Greedy
def get_action(self, state):
"""Compute the action to take for the agent, incorporating exploration.
That is, with probability \epsi , act randomly.
Otherwise, act according to the best policy.
Hint: use random.random()<\epsi to check if exploration is needed.
"""
return None # TODO
# 3. Bridge Crossing Revisited
def question3():
epsilon =...
learning_rate =...
return epsilon, learning_rate
# If not possible, return 'NOT POSSIBLE'
# 5. Approximate Q-Learning
class ApproximateQAgent(QLearningAgent):
"""Implement Approximate Q Learning Agent using weights."""
def __init__(self,*args, extractor):
"""Initialize parameters and store the feature extractor.
Initialize weights table."""
super().__init__(*args)
... # TODO
def get_weight(self, feature):
"""Get weight of a feature.
Never seen feature should have a weight of 0.
"""
return 0 # TODO
def get_q_value(self, state, action):
"""Compute Q value based on the dot product of feature components and weights.
Q(s,a)= w_1* f_1(s,a)+ w_2* f_2(s,a)+...+ w_n * f_n(s,a)
"""
return 0 # TODO
def update(self, state, action, next_state, reward):
"""Update weights using least-squares approximation.
\Delta = R +\gamma V(s')- Q(s,a)
Then update weights: w_i = w_i +\alpha *\Delta * f_i(s, a)
"""
... # TODO

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!