Question: Answer Part 2 Plz: REINFORCE: Monte - Carlo Policy - Gradient Control ( episodic ) for * Input: a differentiable policy parameterization ( a |

Answer Part

2

Plz: REINFORCE: Monte

-

Carlo Policy

-

Gradient Control

(

episodic

)

for

_{*}

Input: a differentiable policy parameterization

(a | s,)

Algorithm parameter: step size

> 0

Initialize policy parameter

i n R^{d^{'}} (

.

.,

0)

Loop forever

(

for each episode

)

Generate an episode

S_{0}, A_{0}, R_{1},

dots,

S_{T - 1}, A_{T - 1}, R_{T},

following

(* | *,)

Loop for each step of the episode

t = 0, 1,

dots,

T - 1

Glarr

_{k = t + 1}^{T}^{k - t - 1} R_{k}

l a r r +^{t}

Ggradln

(A_{t} | S_{t},)

Assume the agent uses REINFORCE for learning a policy while navigating in a continuous

2

D square maze, with center at origin. It starts at the state

(0, 0) .

The agent's policy is

parameterized by a linear function where the final layer outputs the mean action

W s = = (_{x},_{y}) .

Here,

W i n R^{2 x 2}

is a

2 2

matrix initialized as all zeros, and

s

is the state.

During execution, the agent then samples an action

(a_{x}, a_{y}) N (, I),

2 -

dimensional Gaussian distribution with mean

and identity variance. The first trajectory is:

s_{0} = (0, 0), a_{0} = (. 5, - . 2), s_{1} = (1, - . 2), a_{1} = (. 2, . 1), s_{2} = (1.2, - . 1) d o t s, s_{5} = (3.2, 1.3) .

The trajectory ends in

s_{5}

because the agent falls into a trap and receives a

negative reward of

- 1 (R (s_{4}, a_{4}) = - 1) .

Otherwise, the agent receives a reward of

0

for every previous step. Assume

= 0.9 .

What is the return

G

at state

s_{0} ?

Please specify to the

4

th decimal place.

Feedback

Based on answering correctly

G =_{t = 0}^{4}^{t} R (s_{t}, a_{t}) =^{4} * - 1 = - 0.6561

0

points earned

Assume learning rate

= 0.1,

what will be the sum of all the elements in the updated

W

right after we loop over the first state

s_{0} (

so we have just updated

W

based on the

return from state

s_{0}

of this episode and haven't updated the parameters based on

s_{1}

yet.

Please refer to the REINFORCE algorithm on page

328

of the textbook

) ?

Please write the answer to the

4

th decimal place.

Hint: recall the formula of multivariate Gaussian,

p d f = \frac{1}{\sqrt[]{(2)^{d} | |}}

exp

(- \frac{1}{2} (x -)^{T}^{- 1} (x -)),

where

d

is the dimension.

Answer Part 2 Plz: REINFORCE: Monte - Carlo

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

Answer Part 2 plz, I posted the part 1 somewhere else. I know the policy is only one question per post so I posted the problem separately. Part II (1.5 points): Instability of EM Turn off the sliding...

only part 2 plz Part 1. Given the following UML diagram. Answer the following questions. Server DataCenter -types:String for example types="Web#FTP -name: String -numOfConnectedUsers:int -server:...

please i need the missing answer. part two plz What is the present value of all dividends? Correct Since this is a perpetuity (constant payment forever), the present value is PV=IPMT=0.092=22.22 Part...

part 2 plz 7. There are two independent parts to this question.(3+5 points) (o) For the first part of this question, consider the NFA drawn below wih states lubkeled A BC and D. We represent this NFA...

plz answer part 2 ... assume ... commercial office building .. only On October 1, 2020 Atkins Corporation, a C corporation, sold a commercial office building for $900,000 that it was renting to...

G h docs.google.com + G 3.4 Long-Ru.. G 3.5 Equilibrium.. G 3.7 Long-Run... G 3.8.1- Fiscal. G Macro Topic... Dashboard G Macro Topic... G Macro Topic... |3.3 Short-Ru.. G 3.6 Changes... 1. Identify...

12:24 PM Sat Sep 30 e L 83% - r education.wiley.com 6 Chapter 2: Homework 2 Question 2 of8 - / 10 E: .0. As a SOO-kg Moon lander is descending toward the surface, its thruster res an upward force. At...

' @ docs.google.com 4.6 Mon T3 # Olivia - LR AN E X @ Financia Part 3 Stretch Your Thinking- For each of the following financial assets, rank them from 1-8 in each of the three categories, with 1...

Em)s ', @ docs.google.com @ E () R STV (O (O IR unit1 [~ R [ 9. Calculate national savings. Show your work. * 1 point Your answer 10. Calculate investment spending. Show your work. * 1 point Your...

(If someone can help on part 2 plz) there are two years remaining in a sports league's collective bargaining agreement. In the first year of the agreement, a salary cap limits a team's payroll. There...

21. If the cost of direct materials is not a significant portion of the total product cost, it may be c as a. direct labor costs b. selling and administrative costs c. miscellaneous costs d. factory...

Pujah Srinivasan is the controller for HHT Industries. She has been asked to explain the payroll accounts on the financial statements for the preceding month. What information will she find about...

Assume the inflation rate in the United States is 2 . 2 percent. The spot rate for a foreign currency is . 9 4 7 while the lyear forward rate is 9 5 1 . What is the approximate rate of inflation in...

On a separate sheet of paper, find the measurement of the arcs listed below using the given circle: (3 points each) a) EF b) DB C) CFE E A 60 D 52 B C