Question: solve and explain please cheak this is a COE topics HW Question 4: (6 Points) Consider a (2X3) game world that has 6 states {A,B,C,D,E,F}

solve and explain please

cheak this is a COE topics HW

Question 4: (6 Points) Consider a (2X3) game world that has 6 states {A,B,C,D,E,F} and four actions (up, down, left, right) as shown below. For every new episode, the game starts by choosing a random state and it ends at the terminal state (F). When node F is reached, the player receives a reward of +10 and the game ends. For all other actions that do not lead to state F, the reward is -1. Assume that the greedy policy is used after training. Also, assume that =1 and =0.9. Assume that the Q-leaming algorithm was applied, and the following is the initial Q function Q(s,a), where s is a state and a is an action. State \action Up down left right A. Using the initial Q function, perform one action ( B, up) and update the Q function [2 bts! B. Using the initial Q function, perform one episode and update the Q table starting from state A. Note that an episode is defined as full game from a given state until the game ends. [4 pts] Question 4: (6 Points) Consider a (2X3) game world that has 6 states {A,B,C,D,E,F} and four actions (up, down, left, right) as shown below. For every new episode, the game starts by choosing a random state and it ends at the terminal state (F). When node F is reached, the player receives a reward of +10 and the game ends. For all other actions that do not lead to state F, the reward is -1. Assume that the greedy policy is used after training. Also, assume that =1 and =0.9. Assume that the Q-leaming algorithm was applied, and the following is the initial Q function Q(s,a), where s is a state and a is an action. State \action Up down left right A. Using the initial Q function, perform one action ( B, up) and update the Q function [2 bts! B. Using the initial Q function, perform one episode and update the Q table starting from state A. Note that an episode is defined as full game from a given state until the game ends. [4 pts] Question 4: (6 Points) Consider a (2X3) game world that has 6 states {A,B,C,D,E,F} and four actions (up, down, left, right) as shown below. For every new episode, the game starts by choosing a random state and it ends at the terminal state (F). When node F is reached, the player receives a reward of +10 and the game ends. For all other actions that do not lead to state F, the reward is -1. Assume that the greedy policy is used after training. Also, assume that =1 and =0.9. Assume that the Q-leaming algorithm was applied, and the following is the initial Q function Q(s,a), where s is a state and a is an action. State \action Up down left right A. Using the initial Q function, perform one action ( B, up) and update the Q function [2 bts! B. Using the initial Q function, perform one episode and update the Q table starting from state A. Note that an episode is defined as full game from a given state until the game ends. [4 pts] Question 4: (6 Points) Consider a (2X3) game world that has 6 states {A,B,C,D,E,F} and four actions (up, down, left, right) as shown below. For every new episode, the game starts by choosing a random state and it ends at the terminal state (F). When node F is reached, the player receives a reward of +10 and the game ends. For all other actions that do not lead to state F, the reward is -1. Assume that the greedy policy is used after training. Also, assume that =1 and =0.9. Assume that the Q-leaming algorithm was applied, and the following is the initial Q function Q(s,a), where s is a state and a is an action. State \action Up down left right A. Using the initial Q function, perform one action ( B, up) and update the Q function [2 bts! B. Using the initial Q function, perform one episode and update the Q table starting from state A. Note that an episode is defined as full game from a given state until the game ends. [4 pts]

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Accounting Questions!

This question involves the use of AGGREGATE linear PYTHOIN regression on the Auto data set. (a) Perform a simple linear regression with mpg as the response and horsepower as the predictor. Describe...

answer all questions promptly What is the maximum segment length of a 100Base-FX netdwork,Thelast character('X', etc) refers to the line code method used. Line code is a pattern of voltage, current...

IfyouhaveplayedaSimulationcalledProBankerIneedhelpansweringthesequestionsassoonaspossible from the pro bankerassignment attachment..please use spreadsheet and players manual for reference. Need...

In this question assume that p and q are atomic formulae. (a) Compare and contrast path formulae and state formulae in temporal logic. [4 marks] (b) Describe and contrast the meanings of F(G p) and...

I would like to do my NFP project onUnited States Fund for United Nations Children's Fund Prepare an executive summary paper on the Form 990 to include the following: a. A brief overview of your...

Given that P1A B2 = .4 and P1A B2 = .8, find P1B2. 3.100 From a production batch with 16 items, 8 items are randomly selected for quality assurance. In how many different ways can the sample be...

Choose machine F and create tables for Sales Budget, Production Budget, Direct Materials Budget, Overhead Budget, SGA Budget, Income Statement, Cash Budget, and Balance Sheet For the six months ended...

Analysis of the Form 990 for Not-for-Profits (NFPs) Human rights watch: https://www.hrw.org/ Explore the concepts of the class material relating to NFPs and Form 990s. You must find and review / read...

Hi There, I need someone to help me with the attached finance worksheet. There are 3 problems in this excel file - one problem on each individual tab. I am attaching a couple of additional files as...

According to Fortune, solar power now accounts for only 1% of total energy produced. If this number was obtained based on a random sample of 8,000 electricity users, give a 95% confidence interval...

29. Work is positive and electrode is negative in: * (1.5 Points) A.C. Reverse Polarity A.C. Straight Polarity D.C. Reverse Polarity D.C. Straight Polarity 30. Poor penetration defect in welding is...

Ine airrerence between assets and liabilities is called A . deficit. B . net income. C . surplus. D . net worth.

Whats the difference in small talk ans commentary

How can learning and development activities help this organization to show value and impact of its work?

Explain key aspects of e-learning

To what extent can OL ideas help this organization?