Question: Consider the fully recurrent network architecture ( without output activation and bias units ) defined as s ( t ) = W x ( t

Consider the fully recurrent network architecture (without output activation and bias units) defined as
s(t)=Wx(t)+Ra(t-1)
a(t)=f(s(t))
hat(y)(t)=Va(t)
with input vectors x(t), hidden pre-activation vectors s(t), hidden activation vectors a(t), activation function f(*) and parameter matrices R,W,V. Let L(t)=L(y(t),hat(y)(t)) denote the loss function at time t and let L=t=1TL(t) denote the total loss. We use denominator-layout convention, i.e.,(t)=delLdels(t) is a column vector. Which of the following statements are true?
a. The asymptotic complexity of BPTT is O(T2).
b. The gradient of the loss with respect,to the input weights W can be written as delLdelW=t=1T(t)xTT(t).
c. BPTT is a common regularization technique for recurrent neural networks.
d. The gradient of the loss with respect to the recurrent weights R can be written as delLdelR=t=1T(t)aTT(t-1)
e. The deltas fulfill the recursive relation (t)=diag(f'(s(t)))(VTTdelL(t)del(hat(y))(t)+RTT(t-1)).
 Consider the fully recurrent network architecture (without output activation and bias

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!