Question: Given a simplified GRU with input x , output y , and t indicates time. The forget gate at time t is calculated as: f_(t)=ux_(t)+vh_(t-1)

Given a simplified GRU with input

x

, output

y

, and t indicates time. The forget gate at time t is calculated as:\

f_(t)=ux_(t)+vh_(t-1)

\ This gate corresponds to the update and reset gates in the fully gated version. The hidden state at time

t

is:\

h_(t)=ax_(t)+bf_(t)h_(b-1)

\ And the output is:\

y_(t)=wh_(t)

\ The learnable parameters are:

u_(,)v_(,)a,b_(3)

, and

w

. Train this GRU with the squared error loss function:\

l_(t)=(1)/(2)(hat(y)_(t)-y_(t))^(2)

\ Part 1\ Given the data point

(x_(1)=1,y_(1)=2)

and assume that

h_(0)=1

and

u=v=a=b=w=1

, what is the gradient of the loss function

l_(1)

, evaluated\ the these assumed values, with respect to

u

? Please round your answer to one decimal place.\ Assume that we want the network to "forget" everything in the past states, how should we set the parameters (select all that apply):\

u=0 only

\

v=0

only\

u=v=0

only\

a=0

only\

vec(b)=0 only

\ Q3\ Let's train this simplified GRU network on a large dataset where each data point has

T=1000

timesteps. How many matrix multiplications does the\ gradient go through from

l_(T)

to

x_(0)

? Consider the most efficient implementation.\

logT

\

T

\

T^(2)

\

\\\\sqrt(T)
 Given a simplified GRU with input x, output y, and t

ft=uxt+vht1 This gate corresponds to the update and reset gates in the fully gated version. The hidden state at time t is: ht=axt+bftht1 And the output is: yt=wht The learnable parameters are: u,v,a,b, and w. Train this GRU with the squared error loss function: t=21(y^tyt)2 Q1 Given the data point (x1=1,y1=2) and assume that h0=1 and u=v=a=b=w=1, what is the gradient of the loss function 1, evaluated the these assumed values, with respect to u ? Please round your answer to one decimal place. Q2 Assume that we want the network to "forget" everything in the past states, how should we set the parameters (select all that apply): u=0only v=0 only u=v=0 only a=0 only b=0 only Q3 Let's train this simplified GRU network on a large dataset where each data point has T=1000 timesteps. How many matrix multiplications does the gradient go through from T to x0 ? Consider the most efficient implementation. logT T T2 T

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!