We consider the use of a single-hidden-layer neural network for representing a stochas- tic policy or...
Fantastic news! We've Found the answer you've been seeking!
Question:
Transcribed Image Text:
We consider the use of a single-hidden-layer neural network for representing a stochas- tic policy or a value function in RL. The input to the neural network are features of state. The weights of the neural network are the parameters to be updated. There is one output corresponding to each action in the control setting, and a single output in the prediction setting. For reference, a pictorial representation of the neural network is shown below, and the notation explained thereafter. x(s, i) i O Input layer Fully connected Wij O j Fully connected O Hidden layer O k outk(s, k) Output layer • Let there be I, J, and K nodes in the input, hidden, and output layers, respectively. For convenience, define [I] = {0, 1,..., I-1}, [J] = {0, 1,.... J-1}, and [K] = {0, 1,..., K-1}. In the figure, I = 3, J = 4, K = 3. • For states and i E [I], let xr(s, i) denote the i-th feature of state. The input layer merely passes on its input to output. Hence, for states and i [I], we have outI (s, i) = x(s, i). • Every j-th node in the hidden layer linearly combines the outputs of the input layer, and passes the weighted sum through the sigmoid function (3) = 1 for 3 € R. Thus, for states and j € [J], out J(s, j) = o(Σie[1] Wijout I (s, i)). Observe that there is a weight wij connecting each input node i E [I] with each hidden node j € [J]. • Every k-th node in the output layer corresponds to an action (hence take the set of actions as [K]). Node k € [K] linearly combines the outputs of the hidden layer. Thus, for state s and k = [K], out K(s, k) = Σje[J] Wikout.J(s, j). Again, there is a weight wjk connecting each hidden node je [J] with each output node k € [K]. For uniformity of notation, think of the prediction task as having a single action (implemented by taking K = 1). The parameters of the representation are the weights wij for i E [I], j [J] and the weights Wjk for j = [J], k € [K]; in pseudocode you are asked to provide for this question (see below), store these parameters in 2-d arrays wIJ[][] and wJK[][], respectively. You can also assume that functions x(,), outI(,), outJ(,), out K(...), and o(.) are already implemented. 6a. A stochastic policy is implemented using the "soft-max" operator on the outputs of the neural network. Thus, for state s, the probability of selecting action k = [K] is eout K(s,k) Σk'E[K] eout K(s,k') Suppose the current policy is parameterised by weights wIJ[][] and wJK[][]. Say an episode s[0], a[0], r[0], s[1], a[1], r[1], s[2], ..., s[T] is generated by following , where s[T] is a terminal state. Write down pseudocode to perform a REINFORCE update with step size a. You can assume wIJ[][], wJK[][], T, s[], a[], r[], and a are already populated. Your code must terminate with wIJ[][] and w.JK[][] updated as per the REINFORCE rule at the end of the episode. Since you will need to compute a gradient for making the update, show the steps to work out the actual form of the gradient before presenting the pseudocode for the update. [6 marks]. 6b. Suppose the same neural network is used to approximate a value function for a prediction task. In this case we have a single output (that is, K = 1). For each state s, out K(s, 0) is interpreted as V(s). The aim is to drive V towards V" by making TD(0) updates, where is the policy being followed. Suppose the current approximation of V uses weights wIJ[][] and wJK[][]. Say the transition s, r, s' is observed. Write down pseudocode for the TD (0) update performed upon reaching s', with learning rate a and discount factory. Assume wIJ[][], wJK[][], s, r, s', a, and y are already populated. Your code must terminate with wIJ[][] and wJK[][] updated correctly. Here, too, show the steps to obtain the form of the gradient before using it in your pseudocode as a part of the learning update. Since K = 1, you can use wJK[] instead of wJK[][] if you would like, but the 2-dimensional variant is also okay, keeping the second index 0. [3 marks]. We consider the use of a single-hidden-layer neural network for representing a stochas- tic policy or a value function in RL. The input to the neural network are features of state. The weights of the neural network are the parameters to be updated. There is one output corresponding to each action in the control setting, and a single output in the prediction setting. For reference, a pictorial representation of the neural network is shown below, and the notation explained thereafter. x(s, i) i O Input layer Fully connected Wij O j Fully connected O Hidden layer O k outk(s, k) Output layer • Let there be I, J, and K nodes in the input, hidden, and output layers, respectively. For convenience, define [I] = {0, 1,..., I-1}, [J] = {0, 1,.... J-1}, and [K] = {0, 1,..., K-1}. In the figure, I = 3, J = 4, K = 3. • For states and i E [I], let xr(s, i) denote the i-th feature of state. The input layer merely passes on its input to output. Hence, for states and i [I], we have outI (s, i) = x(s, i). • Every j-th node in the hidden layer linearly combines the outputs of the input layer, and passes the weighted sum through the sigmoid function (3) = 1 for 3 € R. Thus, for states and j € [J], out J(s, j) = o(Σie[1] Wijout I (s, i)). Observe that there is a weight wij connecting each input node i E [I] with each hidden node j € [J]. • Every k-th node in the output layer corresponds to an action (hence take the set of actions as [K]). Node k € [K] linearly combines the outputs of the hidden layer. Thus, for state s and k = [K], out K(s, k) = Σje[J] Wikout.J(s, j). Again, there is a weight wjk connecting each hidden node je [J] with each output node k € [K]. For uniformity of notation, think of the prediction task as having a single action (implemented by taking K = 1). The parameters of the representation are the weights wij for i E [I], j [J] and the weights Wjk for j = [J], k € [K]; in pseudocode you are asked to provide for this question (see below), store these parameters in 2-d arrays wIJ[][] and wJK[][], respectively. You can also assume that functions x(,), outI(,), outJ(,), out K(...), and o(.) are already implemented. 6a. A stochastic policy is implemented using the "soft-max" operator on the outputs of the neural network. Thus, for state s, the probability of selecting action k = [K] is eout K(s,k) Σk'E[K] eout K(s,k') Suppose the current policy is parameterised by weights wIJ[][] and wJK[][]. Say an episode s[0], a[0], r[0], s[1], a[1], r[1], s[2], ..., s[T] is generated by following , where s[T] is a terminal state. Write down pseudocode to perform a REINFORCE update with step size a. You can assume wIJ[][], wJK[][], T, s[], a[], r[], and a are already populated. Your code must terminate with wIJ[][] and w.JK[][] updated as per the REINFORCE rule at the end of the episode. Since you will need to compute a gradient for making the update, show the steps to work out the actual form of the gradient before presenting the pseudocode for the update. [6 marks]. 6b. Suppose the same neural network is used to approximate a value function for a prediction task. In this case we have a single output (that is, K = 1). For each state s, out K(s, 0) is interpreted as V(s). The aim is to drive V towards V" by making TD(0) updates, where is the policy being followed. Suppose the current approximation of V uses weights wIJ[][] and wJK[][]. Say the transition s, r, s' is observed. Write down pseudocode for the TD (0) update performed upon reaching s', with learning rate a and discount factory. Assume wIJ[][], wJK[][], s, r, s', a, and y are already populated. Your code must terminate with wIJ[][] and wJK[][] updated correctly. Here, too, show the steps to obtain the form of the gradient before using it in your pseudocode as a part of the learning update. Since K = 1, you can use wJK[] instead of wJK[][] if you would like, but the 2-dimensional variant is also okay, keeping the second index 0. [3 marks].
Expert Answer:
Related Book For
Advanced Accounting
ISBN: 978-0077431808
10th edition
Authors: Joe Hoyle, Thomas Schaefer, Timothy Doupnik
Posted Date:
Students also viewed these computer engineering questions
-
Is the use of a single city test market appropriate? Discuss.
-
Use of a single cost driver rate when an indirect cost pool includes costs that have different cost drivers (causes of costs) leads to distortions in job costs. Do you agree with this statement?...
-
Consider the use of counter mode, as shown in Fig. 8-15, but with IV = 0. Does the use of 0 threaten the security of the cipher in general?
-
A PLC is used to count the number of cans traveling by on a conveyor belt in a fish canning factory. An optical proximity switch detects the passage of each can, sending a discrete (on/off) signal to...
-
What date or event does the profession believe should be used in determining the value of a share option? What arguments support this position?
-
In Exercises 1 through 38, differentiate the given function. f(s) = e s + ln s
-
Paulson Winery in Albany, New York, has two departments: Fermenting and Packaging. Direct materials are added at the beginning of the fermenting process (grapes) and at the end of the packaging...
-
Jack and Maggie Turton bought a house in Jefferson County, Idaho, located directly across the street from a gravel pit. A few years later, the county converted the pit to a landfill. The landfill...
-
8. Money and Foreign exchange markets in Frankfurt and NY are very efficient and reflect the following information Spot Ex rate 1-yr TB rate a) $0.9000/Euro London 6.5% Unknown NY $0.9000/Euro 3.20%...
-
You are to examine two flow geometries as depicted in the figure. The flow rate in the main pipe is to be maintained constant and equal to Q in both scenarios. To make the comparison simple, it will...
-
Bob the Builder is thinking about relocating from Boston to Oxford, OH. Bob the Builder ownsan apartment building in Boston with a fair market value of $800,000, an adjusted basis of$780,000 and is...
-
In addition to conducting an audit of the hotel's lighting systems, you have been assigned the responsibility of exploring viable options for the hotel to embrace renewable technologies as means to...
-
1. The data shows the scores obtained by 33 participants in a quiz. 36 37 37 20 52 11 40 15 25 29 51 40 63 32 45 34 49 68 33 31 41 50 60 59 43 18 44 39 32 64 21 56 13 Using Sturges Rule, construct a...
-
Compute each segment's revenue as a percentage of IBM total revenues by quarter. (Formula for segment's revenue as a percentage=segment's revenue total revenue x 100) 2016 Segment's revenue as...
-
How are properties of neighbours propagated in graph neural nets? Match each approach with its name scalar f(neighbour) scalar f(node, neighbour) vector f(node, neighbour) Attention O. O O This...
-
Boys of a certain age in the nation have a mean weight of 86 with a variance of 86.49 lb. A complaint is made that boys are overfed fed in a municipal children's home. As evidence, a sample of 22...
-
Suppose a friend of yours invested in an outstanding bond with Luminous Lighting. The bond has an annual coupon rate of 8%, a remaining maturity of 16 years, and a $1,000 par value. The market...
-
Element compound homogeneous mixture (heterogeneous mixture) 4) A piece of gold has a mass of 49.75 g. What should the volume be if it is pure gold? Gold has a density of 19.3 g/cm (3 points) D=m/v...
-
What does the term hedging mean? Why do companies elect to follow this strategy?
-
Holmes Corporation has filed a voluntary petition with the bankruptcy court in hope of reorganizing. A statement of financial affairs has been prepared for the company showing these debts:...
-
The HELP partnership has the following capital balances as of the end of the current year: Lennon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . $230,000 McCartney . . . . . . ....
-
a. Express the signal in terms of unit-step functions. b. Find the Laplace transform of the expression in (a) by using the shift on \(t\)-axis. \(g(t)= 0 if t <0 t if 0 1
-
a. Express the signal in terms of unit-step functions. b. Find the Laplace transform of the expression in (a) by using the shift on t -axis. \(g(t)= 0 if t <0 1-t if 0 1
-
Find the Laplace transform of each periodic function whose definition in one period is given. \(h(t)=\left\{\begin{array}{ccc}1 & \text { if } & 0
Study smarter with the SolutionInn App