Problem 4 Why output projection on MHA Consider the standard multi head self attention ( MHA ) layer defined by ) 1 ) H where W O i n R H d h e n d x d o u t W h Q i n R d q d a t t n , W h K i n R d K d s t t s , W h V i n R d V d h e n d Q i n R L d Q , K i n R L d K , V i n R L d V ( Of course, it is often the case that Q K V x i n R L d ) Let us call this model MHA 1 Next, consider a variant that we call MHA 2 ) 1 ) H ) h where W h Q i n R d Q d s t t n , W h K i n R d K d a t t n , W h V i n R d V d o u t Q i n R L d Q , K i n R L d K , V i n R L d V ( a ) Given an MHA 1 model, decompose the rows of W O as W O W 1 O W 2 O v d o t s W H O i n R H d h e s d d o u t such that W 1 O , W 2 O , dots, W H O i n R d b e e d d o a t Show that if we set the parameters of an MHA 2 model as W h V l a r r W h V W h O for h 1 , dots, H and keep all other parameters the same, then the MHA 1 and MHA 2 models are equivalent, i e , ( MHA 1 ( Q , K , V ) MHA 2 ( Q , K , V ) for all inputs Q , K , V ( b ) How many trainable parameters do MHA 1 and MHA 2 have ( c ) If d V d o u t 5 1 2 and d b e a d 6 4 , what is the difference in the number of trainable parameters

The Answer is in the image, click to view ...

Question: Problem 4 : Why output projection on MHA? Consider the standard multi - head self - attention ( MHA ) layer defined by ) 1

Problem

4

: Why output projection on MHA? Consider the standard multi

-

head self

-

attention

(

MHA

)

layer defined by

) 1) H

where

W^{O} i n R^{H d_{h e n d x d_{o u t}}}

W_{h}^{Q} i n R^{d q d_{a t t n}}, W_{h}^{K} i n R^{d_{K} d_{s t t s}}, W_{h}^{V} i n R^{d V d_{h e n d}}

Q i n R^{L d_{Q}}, K i n R^{L d_{K}}, V i n R^{L d_{V}} .

(

Of course, it is often the case that

Q = K = V = x i n R^{L d} .)

Let us call this model MHA

1 .

Next, consider a variant that we call MHA

2 .

) 1) H) h

where

W_{h}^{Q} i n R^{d_{Q} d_{s t t n}}, W_{h}^{K} i n R^{d_{K} d_{a t t n}}, W_{h}^{V} i n R^{d_{V} d_{o u t}}

Q i n R^{L d_{Q}}, K i n R^{L d_{K}}, V i n R^{L d_{V}} .

(

)

Given an MHA

1

model, decompose the rows of

W^{O}

W^{O} = [\begin{matrix} W_{1}^{O} \\ W_{2}^{O} \\ v d o t s \\ W_{H}^{O} \end{matrix}] i n R^{H d_{h e s d} d_{o u t}}

such that

W_{1}^{O}, W_{2}^{O},

dots,

W_{H}^{O} i n R^{d_{b e e d} d_{o a t}} .

Show that if we set the parameters of an MHA

2

model as

W_{h}^{V} l a r r W_{h}^{V} W_{h}^{O}

for

h = 1,

dots,

H

and keep all other parameters the same, then

the MHA

1

and MHA

2

models are equivalent, i

.

., (

MHA

1 (Q, K, V) =

MHA

2 (Q, K, V)

for all inputs

Q, K, V .

(

)

How many trainable parameters do MHA

1

and MHA

2

have?

(

)

d_{V} = d_{o u t} = 512

and

d_{b e a d} = 64,

what is the difference in the number of trainable

parameters?

Problem 4: Why output projection on MHA? Consider the standard multi-head

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

PLEASE READ CAREFULLY THE CASE STUDY PROVIDED AND FEEL FREE TO ADD HERE YOUR COMMENTS FOR EXAMPLE LIKES DISLIKES WORDS OR PHRASES YOU DO NOT UNDERSTAND ANY COMMENTS THAT WILL IMPROVE THE DIALOGUE...

Introduction to Ridgeline Mountain Outfitters (RMO) Ridgeline Mountain Outfitters (RMO) is a large retail company that specializes in clothing and related accessories for all types of outdoor and...

dee complete please help Complexity Theory (a) Defifine the set of Boolean expressions 2CNF and the language 2SAT over them. (b) For a Boolean expression in 2CNF, let G() be the directed graph with...

ttth Suppose that the sequence of bags {Bn | n N} is recursively enumerated by the computable function e(n, x) = fn(x), [7 marks] Hence prove that the set of all recursive bags cannot be recursively...

Miller-Rabin test to check whether a number N is composite. This will involve computing a N1 mod N for some value of a. [10 marks] Carry out the steps for N = 65 and a = 1, 2, 8 and 12. on what each...

CHAPTER 3 PROBLEM MANAGEMENT 77 FIGURE 3.2 The CAPRA Problem-Management System CLIENTS Who are the clients (direct and indirect)? ACQUIRING AND ANALYZING INFORMATION IF What is the apparent problem?...

MUST BE CORRECT ANSWERS A small software company has the following simplified cashflow, funded by shareholders' equity of 20,000 and a bank overdraft of 5000: Invoiced money received 2 months after...

Classic 2.0 Brittany Marshall Sunday, February 14, 2016 This report is provided by: Laureate Education, Inc. 650 S. Exeter St. Baltimore, MD 21202 Telephone (U.S. calls): 1.800.925.3368 Telephone...

: (i) What data structures are maintained by the page manager. (ii) What happens when a machine performs a read operation to a page. (iii) What happens when a machine performs a write operation to a...

3. In an acid-base titration, the neutralization of 25.00 mL ofa solution of KOH (potassium hydroxide) of unknown concentrationrequired the addition of 23.60 mL of 0.1032 M HNO 3 (nitric acid)....

Why might a supervisor be motivated to perceive a subordinate's performance as being poor when it really is not?

6 Bim and Yolander have been out for a training run. Bim ran for 82 kilometres. Yolander ran for 102 kilometres. b calculate, the answers to these questions. estimate, then a What is the difference...

Seved Help 14 Wisconsin Snowmobile Corp. is considering a switch to level production Cost efficiencies would occur under level production, and aftertax costs would decline by $31,500, but inventory...

If you were going to use technology to identify training needs for customer service representatives for an e-commerce clothing company, what steps would you take to ensure that the technology was not...

Explain digital learning. How can it benefit companies? Employees? What are some of its potential limitations?

3. As the HR manager, you have heard rumors about potential efforts to unionize your warehouse employees. Use the Employment Law Information Network (www.elinfonet.com/human-resources/...