Question: Problem 3 REINFORCE: MC Policy-Gradient Control (4pt) Suppose that we use the softmax policy function parameterized by : (s, ) e(s,a)T6 {k=1 e$(s,ax) TO? where

Problem 3 REINFORCE: MC Policy-Gradient Control

Problem 3 REINFORCE: MC Policy-Gradient Control (4pt) Suppose that we use the softmax policy function parameterized by : (s, ) e(s,a)T6 {k=1 e$(s,ax) TO? where o(s, a) is a feature vector of a state-action pair (s, a). Initially, 0 is set to [0.6,0.4). Now, you are given the feature vectors of state-action pairs as follows. (S1, A1) = (1,1), (S1, A2) = (1,2), (S2, A1) = (2,3), (S2, A2) = (2, 1) = = When you experience an episode, S2, A2, 5, S1, A2, 20, S3 with a step size a of 0.02 and a discount factor 7 of 0.5, how is the policy parameter 0 updated? Answer 0 and To(S2, A2) after every update. Note that Voln To(s, a) of the softmax policy is: Veln Te(s, a) = (s, a) Exe[b(s,:)] = $(s, a) +(a | $)6(8, a). = Problem 3 REINFORCE: MC Policy-Gradient Control (4pt) Suppose that we use the softmax policy function parameterized by : (s, ) e(s,a)T6 {k=1 e$(s,ax) TO? where o(s, a) is a feature vector of a state-action pair (s, a). Initially, 0 is set to [0.6,0.4). Now, you are given the feature vectors of state-action pairs as follows. (S1, A1) = (1,1), (S1, A2) = (1,2), (S2, A1) = (2,3), (S2, A2) = (2, 1) = = When you experience an episode, S2, A2, 5, S1, A2, 20, S3 with a step size a of 0.02 and a discount factor 7 of 0.5, how is the policy parameter 0 updated? Answer 0 and To(S2, A2) after every update. Note that Voln To(s, a) of the softmax policy is: Veln Te(s, a) = (s, a) Exe[b(s,:)] = $(s, a) +(a | $)6(8, a). =

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related General Management Questions!

help get the answers.. its a complete question 1. (10 points) Annie and David are painting their apartment. At the paint store, David says he prefers Canary Yellow to Bumblebee Yellow, Lime Yellow,...

GIVE ALL THE CORRRECT ANSSERS Project Management in Practice Managing Costs at Massachusetts' Neighborhood Health Plan In just a 2-year period, Medicaid reduced its rate of reimbursement by 20...

Least Squares Approximation Hector D. Ceniceros 1 Least Squares Approximation Let f be a continuous function on [a, b]. We would like to find the best approximation to f by a polynomial of degree at...

software projects failed. In the run-up to Y2K, most of the world's large companies claimed that fixing the Millennium Bug was a large project whose success was critical to their survival. One would...

now//You are asked to price some options on KYC stock. KYC's stock price can go up by 15 percent every year, or down by 10 percent. Both outcomes are equally likely. The risk free rate is 5 percent,...

KYC's stock price can go up by 15 percent every year, or down by 10 percent. Both outcomes are equally likely. The risk free rate is 5 percent, and the current stock price of KYC is 100. (a) Price a...

Managerial Decision Making Six Decision Stages in Chapter 5 I. Identify and Diagnose the Problem Consider the following questions when identifying and diagnosing the problem: Is there a difference...

y 7-5e~{3(1)) - Symbolab X WeBWork Math 2 Fall 2022 - U x WeBWork : F2022UPG101-104 X F2022UPG101-104.30194232.1 x + ca/webwork2/F2022UPG101-104/Final_Test_MC/24/?effectiveUser=30194232 MAA Logged in...

Chrome File Edit View History Bookmarks Profiles Tab Window Help https://laurelsprings.geniussis. x ] Mathway | Algebra Problem Sol X 9 https://laurelsprings.geniussis. x AP Calculus BC OL v3 A --...

This problem has you consider how to maximize profits when you have one manufacturing facility (say in Ohio) that serves two markets (say in Michigan and Arizona). Thus, rather than just find the q...

What is design of experiments and why is it useful for Six Sigma quality?

What statistical assumptions are made in PERT? Under what conditions are those assumptions realistic?

1 An expense is "necessary" if it is appropriate, helpful, or capable of making a contribution to the taxpayer's profit seeking activity. t or f ?

As the manager of Smith Construction, you need to make a decision on the number of homes to build in a new residential area where you are the only builder. Unfortunately, you must build the homes...