Question: I need help figuring out if my thought process is right for the MDP(Markov Decision Process). ACTION_EAST=0 ACTION_SOUTH=1 ACTION_WEST=2 ACTION_NORTH=3 TRANSITION_SUCCEED=0.8 #The probability that by

I need help figuring out if my thought process is right for the MDP(Markov Decision Process).

ACTION_EAST=0 ACTION_SOUTH=1 ACTION_WEST=2 ACTION_NORTH=3

TRANSITION_SUCCEED=0.8 #The probability that by taking action A, it moves to the expected destination state S'. Here the state S' represents the new state that the action A aims to move to. TRANSITION_FAIL=0.2 #The probability that by taking action A, it moves to an unexpected destination state S'. For example, by taking action East, you may moves to the neighboring direction North or South. So the probability of going to North or South is 0.1. We assume the two directions evenly split the value of TRANSITION_FAIL 0.2 GAMMA=0.9 #the discount factor ACTION_REWARD=-0.1 #The instantaneous for taking each action (we assume the four actions (N/E/W/S) has the same reward) CONVERGENCE=0.0000001 #The threshold for convergence to determine a stop sign cur_convergence=100

Also here is the cell class that is implemented

class Cell: def __init__(self,x,y): self.q_values=[0.0,0.0,0.0,0.0] self.location=(x,y) self.state_value=max(self.q_values) self.policy=0

#####Implement the below functions ############################ def computeQValue(s,action): print('Compute Q Values') #s is state of each cell #action from value 0-3 0-east, 1-south, 2-west, 3-north #For each cell based on action taken the q value is calculated #update the state data with the q value #Do I need this state_old_value = s.state_value.copy() ? if action == ACTION_EAST: s.q_value[0] = ACTION_REWARD + GAMMA*(s.state_value+ACTION_EAST) s.state_value = ACTION_REWARD + GAMMA*(TRANSITION_SUCCEED*s.q_value[0] + TRANSITION_FAIL*s.q_value[1] + TRANSITION_FAIL*s.q_value[3]) elif action == ACTION_SOUTH: s.q_value[1] = ACTION_REWARD + GAMMA*(s.state_value+ACTION_SOUTH) s.state_value = ACTION_REWARD + GAMMA*(TRANSITION_SUCCEED*s.q_value[1] + TRANSITION_FAIL*s.q_value[0] + TRANSITION_FAIL*s.q_value[2]) elif action == ACTION_WEST: s.q_value[2] = ACTION_REWARD + GAMMA*(s.state_value+ACTION_WEST) s.state_value = ACTION_REWARD + GAMMA*(TRANSITION_SUCCEED*s.q_valu[2] + TRANSITION_FAIL*s.q_value[1] + TRANSITION_FAIL*s.q_value[3]) else: s.q_value[3] = ACTION_REWARD + GAMMA*(s.state_value+ACTION_NORTH) s.state_value = ACTION_REWARD + GAMMA*(TRANSITION_SUCCEED*s.q_value[3] + TRANSITION_FAIL*s.q_value[2] + TRANSITION_FAIL*s.q_value[0]) def valueIteration(): print('Value Iteration.') #called in a loop #use the computeQValue and update the state value of each cell #ideally the policy should be obtained less tahn 100 iterations possible #use the cur_convergence and convergence states.state_value def policyEvaluation(): print('Policy Evaluation') #updating the state values of each cell based on current policy #for i in range(policy): #states.state_value = ACTION_REWARD + GAMMA*old_state_value[i + states.policy] def policyImprovement(): print('Policy Improvement.') #responsible for updating the policy of each cell #getting the max q value states.policy =+1 #The least q value? states.policy = -1 #policy iteration should be less than the value iteration

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

For the Markov Decision Process(MDP) this method is called in a loop and is supposed to update the state value of each cell. Since its already called in a loop I did not think it needs to be in a...

Elsa's Boards sells a snowboard, Xpert, that is popular with snowboard enthusiasts. Information relating to Elsa's purchases of Xpert snowboards during September is shown below. During the same...

Internal controls are methods and strategies used to keep information and inventory safe from theft and to easily tell if something is compromised or missing. In this assignment, you will recommend...

Please help, also could you show the formulas as well. Please help. Does it look better now, the first excel sheet goes all the up to 12 & then the total. please Sprinkles Ice Cream Shoppe Please...

Write a Java program that will read a json files ( MidsummerNightsDream.json) parse them with the gson library and the set of Java methods you were provided, and create a set of HTML web pages. home...

I need helping figuring out how the financial statements would be adjusted after something like this occurs. Would someone be able to explain how I would figure that out please? Here is what I have...

AFM 362 - Taxation I Fall 2020 Group Assignment Your friend Nathan was so appreciative of the advice you provided to him about incorporation of his income, so he referred a co-worker to you who...

a) Let's say a year of college currently costs $20,000 in today's dollars. If your clients' child is currently 9 years old and will start college at 18 years of age, how much will the first year of...

Please help with line 16. I've checked the 2022 tax table for a married filing jointly couple for $38,442 and it says it should be $4,200, but as you can see its incorrect. I'm not sure if I'm...

Hi there, I just need help understanding how to solve for the circled part in red for cash and cash equivalents going from June 30, 2019, of $1,559.00 to June 30, 2020, cash and cash equivalents of...

KMS Corporation faces a choice between paying out $300 million cash through a share repurchase, and investing the $300 million in Treasury securities paying 6% interest for one year. If shareholders...

Use the data from Exercise 14 in Section 21 and find the mean and modal class.

Regarding promoters, which of the following is false? Promoters prepare the corporation's incorporation papers. Promoters can purchase bulldings for the corporation. A promoter may insert a clause...

8. Tris- HCl in solution will produce a pH of approximately 4.7 and Tris-Base in solution will produce a pH of approximately 10.4. Which one has the better buffering capacity and why? 9. Sucrose...

What is DDL?

What is the difference between Oracle SQL Developer and Oracle SQL Developer Data Modeler?

In modern computer applications, how is Referential Integrity Rule Compliance made easy for the system user?