Question: I need help figuring out if my thought process is right for the MDP(Markov Decision Process). ACTION_EAST=0 ACTION_SOUTH=1 ACTION_WEST=2 ACTION_NORTH=3 TRANSITION_SUCCEED=0.8 #The probability that by

I need help figuring out if my thought process is right for the MDP(Markov Decision Process).

ACTION_EAST=0 ACTION_SOUTH=1 ACTION_WEST=2 ACTION_NORTH=3

TRANSITION_SUCCEED=0.8 #The probability that by taking action A, it moves to the expected destination state S'. Here the state S' represents the new state that the action A aims to move to. TRANSITION_FAIL=0.2 #The probability that by taking action A, it moves to an unexpected destination state S'. For example, by taking action East, you may moves to the neighboring direction North or South. So the probability of going to North or South is 0.1. We assume the two directions evenly split the value of TRANSITION_FAIL 0.2 GAMMA=0.9 #the discount factor ACTION_REWARD=-0.1 #The instantaneous for taking each action (we assume the four actions (N/E/W/S) has the same reward) CONVERGENCE=0.0000001 #The threshold for convergence to determine a stop sign cur_convergence=100

Also here is the cell class that is implemented

class Cell: def __init__(self,x,y): self.q_values=[0.0,0.0,0.0,0.0] self.location=(x,y) self.state_value=max(self.q_values) self.policy=0

#####Implement the below functions ############################ def computeQValue(s,action): print('Compute Q Values') #s is state of each cell #action from value 0-3 0-east, 1-south, 2-west, 3-north #For each cell based on action taken the q value is calculated #update the state data with the q value #Do I need this state_old_value = s.state_value.copy() ? if action == ACTION_EAST: s.q_value[0] = ACTION_REWARD + GAMMA*(s.state_value+ACTION_EAST) s.state_value = ACTION_REWARD + GAMMA*(TRANSITION_SUCCEED*s.q_value[0] + TRANSITION_FAIL*s.q_value[1] + TRANSITION_FAIL*s.q_value[3]) elif action == ACTION_SOUTH: s.q_value[1] = ACTION_REWARD + GAMMA*(s.state_value+ACTION_SOUTH) s.state_value = ACTION_REWARD + GAMMA*(TRANSITION_SUCCEED*s.q_value[1] + TRANSITION_FAIL*s.q_value[0] + TRANSITION_FAIL*s.q_value[2]) elif action == ACTION_WEST: s.q_value[2] = ACTION_REWARD + GAMMA*(s.state_value+ACTION_WEST) s.state_value = ACTION_REWARD + GAMMA*(TRANSITION_SUCCEED*s.q_valu[2] + TRANSITION_FAIL*s.q_value[1] + TRANSITION_FAIL*s.q_value[3]) else: s.q_value[3] = ACTION_REWARD + GAMMA*(s.state_value+ACTION_NORTH) s.state_value = ACTION_REWARD + GAMMA*(TRANSITION_SUCCEED*s.q_value[3] + TRANSITION_FAIL*s.q_value[2] + TRANSITION_FAIL*s.q_value[0]) def valueIteration(): print('Value Iteration.') #called in a loop #use the computeQValue and update the state value of each cell #ideally the policy should be obtained less tahn 100 iterations possible #use the cur_convergence and convergence states.state_value def policyEvaluation(): print('Policy Evaluation') #updating the state values of each cell based on current policy #for i in range(policy): #states.state_value = ACTION_REWARD + GAMMA*old_state_value[i + states.policy] def policyImprovement(): print('Policy Improvement.') #responsible for updating the policy of each cell #getting the max q value states.policy =+1 #The least q value? states.policy = -1 #policy iteration should be less than the value iteration

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!