Given a simplified GRU with input x , output y , and t indicates time The forget gate at time t is calculated as f (t) ux (t) vh (t 1) This gate corresponds to the update and reset gates in the fully gated version The hidden state at time t is h (t) ax (t) bf (t)h (b 1) And the output is y (t) wh (t) The learnable parameters are u (,)v (,)a,b (3) , and w Train this GRU with the squared error loss function l (t) (1) (2)(hat(y) (t) y (t)) (2) Part 1 Given the data point (x (1) 1,y (1) 2) and assume that h (0) 1 and u v a b w 1 , what is the gradient of the loss function l (1) , evaluated the these assumed values, with respect to u Please round your answer to one decimal place Assume that we want the network to forget everything in the past states, how should we set the parameters (select all that apply) u 0 only v 0 only u v 0 only a 0 only vec(b) 0 only Q3 Let's train this simplified GRU network on a large dataset where each data point has T 1000 timesteps How many matrix multiplications does the gradient go through from l (T) to x (0) Consider the most efficient implementation logT T T (2) sqrt(T) ft uxt vht1 This gate corresponds to the update and reset gates in the fully gated version The hidden state at time t is ht axt bftht1 And the output is yt wht The learnable parameters are u,v,a,b, and w Train this GRU with the squared error loss function t 21(y tyt)2 Q1 Given the data point (x1 1,y1 2) and assume that h0 1 and u v a b w 1, what is the gradient of the loss function 1, evaluated the these assumed values, with respect to u Please round your answer to one decimal place Q2 Assume that we want the network to forget everything in the past states, how should we set the parameters (select all that apply) u 0only v 0 only u v 0 only a 0 only b 0 only Q3 Let's train this simplified GRU network on a large dataset where each data point has T 1000 timesteps How many matrix multiplications does the gradient go through from T to x0 Consider the most efficient implementation logT T T2 T

The Answer is in the image, click to view ...

Question: Given a simplified GRU with input x , output y , and t indicates time. The forget gate at time t is calculated as: f_(t)=ux_(t)+vh_(t-1)

Given a simplified GRU with input

x

, output

y

, and t indicates time. The forget gate at time t is calculated as:\

f_(t)=ux_(t)+vh_(t-1)

\ This gate corresponds to the update and reset gates in the fully gated version. The hidden state at time

t

is:\

h_(t)=ax_(t)+bf_(t)h_(b-1)

\ And the output is:\

y_(t)=wh_(t)

\ The learnable parameters are:

u_(,)v_(,)a,b_(3)

, and

w

. Train this GRU with the squared error loss function:\

l_(t)=(1)/(2)(hat(y)_(t)-y_(t))^(2)

\ Part 1\ Given the data point

(x_(1)=1,y_(1)=2)

and assume that

h_(0)=1

and

u=v=a=b=w=1

, what is the gradient of the loss function

l_(1)

, evaluated\ the these assumed values, with respect to

u

? Please round your answer to one decimal place.\ Assume that we want the network to "forget" everything in the past states, how should we set the parameters (select all that apply):\

u=0 only

v=0

only\

u=v=0

only\

a=0

only\

vec(b)=0 only

\ Q3\ Let's train this simplified GRU network on a large dataset where each data point has

T=1000

timesteps. How many matrix multiplications does the\ gradient go through from

l_(T)

x_(0)

? Consider the most efficient implementation.\

logT

T

T^(2)

\\\\sqrt(T)

Given a simplified GRU with input x, output y, and t

ft=uxt+vht1 This gate corresponds to the update and reset gates in the fully gated version. The hidden state at time t is: ht=axt+bftht1 And the output is: yt=wht The learnable parameters are: u,v,a,b, and w. Train this GRU with the squared error loss function: t=21(y^tyt)2 Q1 Given the data point (x1=1,y1=2) and assume that h0=1 and u=v=a=b=w=1, what is the gradient of the loss function 1, evaluated the these assumed values, with respect to u ? Please round your answer to one decimal place. Q2 Assume that we want the network to "forget" everything in the past states, how should we set the parameters (select all that apply): u=0only v=0 only u=v=0 only a=0 only b=0 only Q3 Let's train this simplified GRU network on a large dataset where each data point has T=1000 timesteps. How many matrix multiplications does the gradient go through from T to x0 ? Consider the most efficient implementation. logT T T2 T

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Given a simplified GRU with input x , output y , and t indicates time. The forget gate at time t is calculated as:\ f_(t)=ux_(t)+vh_(t-1) \ This gate corresponds to the update and reset gates in the...

Can you shouw how to solve this question in the image. Please show how you are getting the anser Existing answers on the site are wrong : / Thanks. Given a simplified GRU with input x , output y ,...

Here are definitions of sequences and trees with integer elements: type iseq = Nil | Cons of int * (unit -> iseq) type itree = Leaf of int | Branch of itree * itree (a) In an ascending sequence such...

PYTHON - USE JUPITER NOTEBOOK & SHOW ITS CODE [ ALL THE INFO IS GIVEN } Please, find below the txt file data to use: *This file contains information of precipitation per cell and per time. *For each...

Question 1 (1 point) The curve showing the maximum attainable combinations of two goods that can be produced with available resources and current technology is known as.. Question 1 options:...

Dear Tutor, Kindly provide me the answer with explanation (step by step), so that i will be more understand. Thank you in advance. Course: Calculus Assignment / Exercises on Function PART 2 -...

Its database concepts QUESTION 4 2. Given a simplified COMPANY database shown here The operation Insert into WORKS ON table violates O A key constraint B. entity integrity constraint OC, domain...

Given the following expression: E = ((y'x) (x + y))' a) Draw the circuit that represents this expression as is (do not simplify). b) Simplify the expression using Boolean algebra rules. State your...

Find the distance between the parallel planes + 2y + z = 9 and 6x - 4y - 2z = 19.

A company borrows US dollars at 5.1% for 183 days and then at maturity refinances the principal and interest at 5.3% for a further 92 days. What is the simple cost of borrowing over the 9 months?

The cost of retained earnings is D V K e - g Y ( 1 - T ) D 1 P 0 + g R f + ( K m - R f )

choose? Which of the following popular myths. T poit argues that business practices are basically amoral, since business operates in a free market? A. Ethics is a public not personal affair B....

=+3 What is the anticipated duration of the international assignment? Is it a shortterm assignment (usually less than one year and probably within the same calendar/tax year) or long-term...

=+4 What happens with the IA at the end of the assignment? What are the repatriation plans for the IA upon completion of the assignment? Is the employee returning to the home country, continuing in...

=+whom an overseas assignment is part of their normal workforce? Demanddriven assignments may require a better compensation package than learningdriven assignments as the IA will derive developmental...