Question: The goal of this assignment is to implement QLearning method on Taxi - v 3 enviroment at openai gym framework. Your task in this enviroment

The goal of this assignment is to implement QLearning method on Taxi-v3 enviroment at openai gym framework.
Your task in this enviroment is to pick up the passenger at one location and drop him off in another, located at possible 4 locations (labeled by different letters). In the example given below, you are expected to pick him up at Y and drop him at G. You receive +20 points for a successful dropoff, and lose 1 point for every timestep it takes. There is also a 10 point penalty for illegal pick-up and drop-off actions.
Note that dynamics of the model are assumed to be unknown.
below is the original code, impliment the QLearning method accordingly
import gymnasium as gym
import time
import numpy as np
import os
import random
def qLearning(env):
nS = env.observation_space.n
nA = env.action_space.n
Q = np.zeros ([nS, nA], dtype=np.int32)
alpha =0.8
gamma =0.9
epsilon =1
num_iter =10000
for i in range (num_iter):
s, actions = env. reset()
for step in range (100):
action = env.action_space.sample()
#action = np.argmax(Q[s])
sp, reward, done, info = env.step (action)
Q[s, action]= Q[s,action]+ alpha *(reward +gamma *np.max (Q[sp,:])- Q[s, action])
S = sp
if i%1000==0 :
print (f"Episode {i}")
return Q
def SARSA (env):
nS = env.observation_space.n
nA = env.action_space.n
Q = np.zeros ([nS,nA], dtype=np. int32)
alpha =0.8
gamma =0.9
epsilon =1
num_iter =1000
for i in range (num_iter):
S, actions = env.reset()
a = env.action_space.sample()
for step in range (100):
sp, reward, done, truncated, info = env. step(a)
ap = np.argmax (Q[sp])
Q[S, a]= Q[S,a]+ alpha *(reward + gamma * Q[sp, ap]-Q[S,a])
S = sp
a = ap
if i%1000==0 :
print(f"Episode {i}")
return Q
env = gym.make('Taxi-v3', render_mode="human" )
observation,info = env.reset ()
Q = SARSA (env)
observation = env. reset()
done=False
sumreward =0
while not done:
os. system('cls')
env. render ()
action = np.argmax (Q[observation])
observation, reward, done, truncated, info = env. step(action)
sumreward += reward
time.sleep (0.5)
if done:
observation = env. reset ()
print ('done with reward:', reward)
env. close()

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!