Given a simplified GRU with input x , output y , and t indicates time The forget gate at time t is calculated as f (t) ux (t) vh (t 1) This gate corresponds to the update and reset gates in the fully gated version The hidden state at time t is h (t) ax (t) bf (t)h (b 1) And the output is y (t) wh (t) The learnable parameters are u (,)v (,)a,b (3) , and w Train this GRU with the squared error loss function l (t) (1) (2)(hat(y) (t) y (t)) (2) Part 1 Given the data point (x (1) 1,y (1) 2) and assume that h (0) 1 and u v a b w 1 , what is the gradient of the loss function l (1) , evaluated the these assumed values, with respect to u Please round your answer to one decimal place Assume that we want the network to forget everything in the past states, how should we set the parameters (select all that apply) u 0 only v 0 only u v 0 only a 0 only vec(b) 0 only Q3 Let's train this simplified GRU network on a large dataset where each data point has T 1000 timesteps How many matrix multiplications does the gradient go through from l (T) to x (0) Consider the most efficient implementation logT T T (2) sqrt(T) ft uxt vht1 This gate corresponds to the update and reset gates in the fully gated version The hidden state at time t is ht axt bftht1 And the output is yt wht The learnable parameters are u,v,a,b, and w Train this GRU with the squared error loss function t 21(y tyt)2 Q1 Given the data point (x1 1,y1 2) and assume that h0 1 and u v a b w 1, what is the gradient of the loss function 1, evaluated the these assumed values, with respect to u Please round your answer to one decimal place Q2 Assume that we want the network to forget everything in the past states, how should we set the parameters (select all that apply) u 0only v 0 only u v 0 only a 0 only b 0 only Q3 Let's train this simplified GRU network on a large dataset where each data point has T 1000 timesteps How many matrix multiplications does the gradient go through from T to x0 Consider the most efficient implementation logT T T2 T

The Answer is in the image, click to view ...

Question: Given a simplified GRU with input x , output y , and t indicates time. The forget gate at time t is calculated as: f_(t)=ux_(t)+vh_(t-1)

Given a simplified GRU with input

x

, output

y

, and t indicates time. The forget gate at time t is calculated as:\

f_(t)=ux_(t)+vh_(t-1)

\ This gate corresponds to the update and reset gates in the fully gated version. The hidden state at time

t

is:\

h_(t)=ax_(t)+bf_(t)h_(b-1)

\ And the output is:\

y_(t)=wh_(t)

\ The learnable parameters are:

u_(,)v_(,)a,b_(3)

, and

w

. Train this GRU with the squared error loss function:\

l_(t)=(1)/(2)(hat(y)_(t)-y_(t))^(2)

\ Part 1\ Given the data point

(x_(1)=1,y_(1)=2)

and assume that

h_(0)=1

and

u=v=a=b=w=1

, what is the gradient of the loss function

l_(1)

, evaluated\ the these assumed values, with respect to

u

? Please round your answer to one decimal place.\ Assume that we want the network to "forget" everything in the past states, how should we set the parameters (select all that apply):\

u=0 only

v=0

only\

u=v=0

only\

a=0

only\

vec(b)=0 only

\ Q3\ Let's train this simplified GRU network on a large dataset where each data point has

T=1000

timesteps. How many matrix multiplications does the\ gradient go through from

l_(T)

x_(0)

? Consider the most efficient implementation.\

logT

T

T^(2)

\\\\sqrt(T)

Given a simplified GRU with input x, output y, and t

ft=uxt+vht1 This gate corresponds to the update and reset gates in the fully gated version. The hidden state at time t is: ht=axt+bftht1 And the output is: yt=wht The learnable parameters are: u,v,a,b, and w. Train this GRU with the squared error loss function: t=21(y^tyt)2 Q1 Given the data point (x1=1,y1=2) and assume that h0=1 and u=v=a=b=w=1, what is the gradient of the loss function 1, evaluated the these assumed values, with respect to u ? Please round your answer to one decimal place. Q2 Assume that we want the network to "forget" everything in the past states, how should we set the parameters (select all that apply): u=0only v=0 only u=v=0 only a=0 only b=0 only Q3 Let's train this simplified GRU network on a large dataset where each data point has T=1000 timesteps. How many matrix multiplications does the gradient go through from T to x0 ? Consider the most efficient implementation. logT T T2 T

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Given a simplified GRU with input x , output y , and t indicates time. The forget gate at time t is calculated as:\ f_(t)=ux_(t)+vh_(t-1) \ This gate corresponds to the update and reset gates in the...

Can you shouw how to solve this question in the image. Please show how you are getting the anser Existing answers on the site are wrong : / Thanks. Given a simplified GRU with input x , output y ,...

Here are definitions of sequences and trees with integer elements: type iseq = Nil | Cons of int * (unit -> iseq) type itree = Leaf of int | Branch of itree * itree (a) In an ascending sequence such...

PYTHON - USE JUPITER NOTEBOOK & SHOW ITS CODE [ ALL THE INFO IS GIVEN } Please, find below the txt file data to use: *This file contains information of precipitation per cell and per time. *For each...

Question 1 (1 point) The curve showing the maximum attainable combinations of two goods that can be produced with available resources and current technology is known as.. Question 1 options:...

Dear Tutor, Kindly provide me the answer with explanation (step by step), so that i will be more understand. Thank you in advance. Course: Calculus Assignment / Exercises on Function PART 2 -...

Its database concepts QUESTION 4 2. Given a simplified COMPANY database shown here The operation Insert into WORKS ON table violates O A key constraint B. entity integrity constraint OC, domain...

Given the following expression: E = ((y'x) (x + y))' a) Draw the circuit that represents this expression as is (do not simplify). b) Simplify the expression using Boolean algebra rules. State your...

Jack Hammer Company completed the following transactions. The annual accounting period ends December 31. Apr. 30 Received $ 600,000 from Commerce Bank after signing a 12-month, 6 percent, promissory...

Of the four methods of capital budgeting analysis, select one of the two that are used for investments with short life spans and describe that method and how it is used.

Herthey Corporation has increased its annual dividend by about 3 percent in each of the last 2 0 years. In the firmis current annual report, the president of Hershey states that the firm intends to...

Cove's Cakes is a local bakery. Price and cost information follows: \ table [ [ Price per cake,$ 1 3 . 2 1

=+Duration (and especially the number of days out of the home country in a particular tax year) usually has important tax ramifications for the employee (and for the employer, if it is taking...

=+3 What is the anticipated duration of the international assignment? Is it a shortterm assignment (usually less than one year and probably within the same calendar/tax year) or long-term...

=+4 What happens with the IA at the end of the assignment? What are the repatriation plans for the IA upon completion of the assignment? Is the employee returning to the home country, continuing in...