2 Reducing Variance in Policy Gradient Methods In class, we explored REINFORCE as a policy gradient...
Fantastic news! We've Found the answer you've been seeking!
Question:
Transcribed Image Text:
2 Reducing Variance in Policy Gradient Methods In class, we explored REINFORCE as a policy gradient method with no bias but high variance. In this problem, we will explore methods to dramatically reduce variance in policy gradient methods, potentially at the cost of increased bias. Let us consider an infinite horizon MDP M = (S, A, R, T, y). Let us define AT (St, at) = Q (st, at) V (st) An approximation to the policy gradient is defined as 9 = Eso: a0:00 where the colon notation a : b represents the range [a, a +1, a + 2, ...b] inclusive of both ends. (a) [3 points (Online)] Please refer to question 2 of the Gradescope online assessment A3 (Quiz). where, (b) [3 points (Written)] Prove that Var (R++1) Var(Rt) is true if we assume that rt+1 is, on average, correlated with the previous rewards, i.e. i0 Cov(ri, rt+1) > 0. You may find the following properties helpful in proving this result: [A (st, at)Ve log (at, st)] t=0 Var (X +Y) = Var(X) + Var (Y) + 2Cov(X, Y) Cov(X + Y,Z) = Cov(X, Z) + Cov(Y, Z) Hint: Try to write the expression for the return at a given timestep as a sum of rewards. (c) [5 points (Written)] In practice, we do not have access to the true function A (st, at), so we would like to obtain an estimate instead. We will consider the general form of an estimator At($0:, a0:) that can be a function of the entire trajectory. Consider the following setup: t ($0:00, 90:00) Est+1:00 (t (St:, at:)] = Q (St, at) at+1:00 Es0:00 a0:00 which indicates that for all st, at, we have that t is an unbiased estimator of the true Q. Also note, that b, is an arbitrary function of the actions and states sampled before at. Prove that by using this estimate of t, we obtain an unbiased estimate of the policy gradient g. In other words, prove the following identity: t=0 = Qt (St:, at:) - bt (so:t, 0:t-1) t($0:, 0:) V log (at, st)] = g 0 Note: Recall the following result from class: Er [(bt (so:t, ao:t-1))V log (t, St)] = 0 You may cite this result without proof in your answer. Please consult the a.1 for further details on how we derived this result. We have also provided you with the first few lines of a proof for this questions to help you get started. Es0:00 a0:00 [ t($0:, 0:) V log e(at, St)] t=0 = E$0:00 a0:00 = Es0:00 a0:00 [(Qt (St:, at:0) bt (S0:t, 0:t1 t=0 [(t (St:, at:))V log (at, St)] - Eso: [(bt (80:t, a0:t-1)) V log (a, st)] a0:00 t=0 t=0 1)) V log (t, st)] 0 (d) [3 points (Written)] We will now look at a few different variants of . Recall the TD error S (st, at) = rt + Y (St+1) (st). If V = V", prove that d is an unbiased estimate of A. Note: Recall that an estimator is an unbiased estimator of 0 if E[] = 0. (e) [3 points (Written)] Let us define (k) = kly. Show that (k) = (st) + y^(St+k) + k=1' yrtti. i=0 t+i. i=0 In general, how does bias and variance change as k increases? A few sentences of justification will suffice when describing the changes in variance and bias as we increase k, no formal proof is necessary. Hint: if you expand the expression for (k) you should observe a telescoping sum which can help simplify your proof for the first part of this question. (f) [3 points (Written)] Show that A() = o Vrt+i V (st). Note: you may assume that 0 y < 1. 2 Reducing Variance in Policy Gradient Methods In class, we explored REINFORCE as a policy gradient method with no bias but high variance. In this problem, we will explore methods to dramatically reduce variance in policy gradient methods, potentially at the cost of increased bias. Let us consider an infinite horizon MDP M = (S, A, R, T, y). Let us define AT (St, at) = Q (st, at) V (st) An approximation to the policy gradient is defined as 9 = Eso: a0:00 where the colon notation a : b represents the range [a, a +1, a + 2, ...b] inclusive of both ends. (a) [3 points (Online)] Please refer to question 2 of the Gradescope online assessment A3 (Quiz). where, (b) [3 points (Written)] Prove that Var (R++1) Var(Rt) is true if we assume that rt+1 is, on average, correlated with the previous rewards, i.e. i0 Cov(ri, rt+1) > 0. You may find the following properties helpful in proving this result: [A (st, at)Ve log (at, st)] t=0 Var (X +Y) = Var(X) + Var (Y) + 2Cov(X, Y) Cov(X + Y,Z) = Cov(X, Z) + Cov(Y, Z) Hint: Try to write the expression for the return at a given timestep as a sum of rewards. (c) [5 points (Written)] In practice, we do not have access to the true function A (st, at), so we would like to obtain an estimate instead. We will consider the general form of an estimator At($0:, a0:) that can be a function of the entire trajectory. Consider the following setup: t ($0:00, 90:00) Est+1:00 (t (St:, at:)] = Q (St, at) at+1:00 Es0:00 a0:00 which indicates that for all st, at, we have that t is an unbiased estimator of the true Q. Also note, that b, is an arbitrary function of the actions and states sampled before at. Prove that by using this estimate of t, we obtain an unbiased estimate of the policy gradient g. In other words, prove the following identity: t=0 = Qt (St:, at:) - bt (so:t, 0:t-1) t($0:, 0:) V log (at, st)] = g 0 Note: Recall the following result from class: Er [(bt (so:t, ao:t-1))V log (t, St)] = 0 You may cite this result without proof in your answer. Please consult the a.1 for further details on how we derived this result. We have also provided you with the first few lines of a proof for this questions to help you get started. Es0:00 a0:00 [ t($0:, 0:) V log e(at, St)] t=0 = E$0:00 a0:00 = Es0:00 a0:00 [(Qt (St:, at:0) bt (S0:t, 0:t1 t=0 [(t (St:, at:))V log (at, St)] - Eso: [(bt (80:t, a0:t-1)) V log (a, st)] a0:00 t=0 t=0 1)) V log (t, st)] 0 (d) [3 points (Written)] We will now look at a few different variants of . Recall the TD error S (st, at) = rt + Y (St+1) (st). If V = V", prove that d is an unbiased estimate of A. Note: Recall that an estimator is an unbiased estimator of 0 if E[] = 0. (e) [3 points (Written)] Let us define (k) = kly. Show that (k) = (st) + y^(St+k) + k=1' yrtti. i=0 t+i. i=0 In general, how does bias and variance change as k increases? A few sentences of justification will suffice when describing the changes in variance and bias as we increase k, no formal proof is necessary. Hint: if you expand the expression for (k) you should observe a telescoping sum which can help simplify your proof for the first part of this question. (f) [3 points (Written)] Show that A() = o Vrt+i V (st). Note: you may assume that 0 y < 1.
Expert Answer:
Related Book For
Financial Accounting and Reporting a Global Perspective
ISBN: 978-1408076866
4th edition
Authors: Michel Lebas, Herve Stolowy, Yuan Ding
Posted Date:
Students also viewed these algorithms questions
-
Design a Java class that represents a cache with a fixed size. It should support operations like add, retrieve, and remove, and it should evict the least recently used item when it reaches capacity.
-
MUST BE CORRECT ANSWERS A small software company has the following simplified cashflow, funded by shareholders' equity of 20,000 and a bank overdraft of 5000: Invoiced money received 2 months after...
-
Darrel & Co. makes electronic components. Chris Darrel, the president, recently instructed Vice President Jim Bruegger to develop a total quality control program. If we dont at least match the...
-
A small insect viewed through a convex lens is 1.8 cm from the lens and appears 2.5 times larger than its actual size. What is the focal length of the lens?
-
The length of a guest lecturers talk represents the third quartile for talks in a guest lecture series. Make an observation about the length of the talk.
-
A manufacturer of submersible pumps claims that at most \(30 \%\) of the pumps require repairs within the first 5 years of operation. If a random sample of 120 of these pumps includes 47 which...
-
Watson Company has a subsidiary in the country of Alonza where the local currency unit is the kamel (KM). On December 31, 2010, the subsidiary has the following balance sheet: The subsidiary acquired...
-
Durkheim's idea of the collective consciousness. What are "social facts"? Explain how Durkheim defined "collective consciousness." Is collective consciousness a social fact? Define this and say why...
-
Brothers Herm and Steve Hargenrater began operations of their tool and die shop (H & H Tool) on January 1, 1987, in Meadville, PA. The annual reporting period ends December 31. Assume that the trial...
-
A Person Cannot See A Single Cotton Thread 100 Feet Away, But If You Wound Thousands Of Threads Together Into A Rope, It Would Be Visible Much Further Away. Is This Statement Analogous To Our DNA...
-
If a Project's IRR is 9 percent and the project provides annual cash flows of $15990 for 6 years, how much did the project cost?
-
Design a plan for developing and implementing school practices with community members, families, and staff to create a more inclusive and culturally responsive school. Include professional...
-
Lululemon Athletica's share is currently trading at $50, and its market capitalization is $7 million. The firm's beta is 1.5, the risk-free rate is 2.4%, and the market risk premium is 6%. The firm...
-
A piece of equipment was sold at the end of a project. The project received $85,000 for the equipment that carried a book value of $75,000. The tax rate is 35%. What is the salvage value?
-
Clad corporation and Stellar corporation both report on a calendar year basis. Clad merged into Stellar on June 30 Year N. Clad had an allowable net operating loss carryover of $270,000. Stellar's...
-
Find the arc length of the given curve at the indicated interval of the variable. (y +1)2 = 4x, from x = 0 to x = 1
-
You have accepted the engagement of auditing the financial statements of the C. Reis Company, a small manufacturing firm that has been your auditee for several years. Because you were busy writing...
-
Wipro Limited, together with its subsidiaries and equity accounted investees (collectively, Wipro) is a leading India-based provider of IT Services, including Business Process Outsourcing (BPO)...
-
Transactions related to Ives Companys shareholders equity (far left column) are listed in the following table. For each line you are provided with three choices of impact of the transaction on the...
-
The following information concerns seven US companies operating solely or mainly restaurants. McDonalds Corporation McDonalds Corporation franchises and operates McDonalds restaurants in the food...
-
For many years, womens professional basketball struggled for consistency in the United States. Since 1978, when the Womens Professional Basketball League (WBL) was formed, leagues have had difficulty...
-
What are the five forms of financing, and how is each used within sport?
-
That financial ratios are most valuable when viewed in comparison to the organizations historical ratio values and competitors values. Why is this context valuable when examining financial ratio...
Study smarter with the SolutionInn App