Question: Problem 4 : Why output projection on MHA? Consider the standard multi - head self - attention ( MHA ) layer defined by ) 1

Problem 4: Why output projection on MHA? Consider the standard multi-head self-attention
(MHA) layer defined by
)1)H
where
WOinRHdhendxdout
WhQinRdqdattn,WhKinRdKdstts,WhVinRdVdhend
QinRLdQ,KinRLdK,VinRLdV.
(Of course, it is often the case that Q=K=V=xinRLd.) Let us call this model MHA1.
Next, consider a variant that we call MHA2.
)1)H)h
where
WhQinRdQdsttn,WhKinRdKdattn,WhVinRdVdout
QinRLdQ,KinRLdK,VinRLdV.
(a) Given an MHA1 model, decompose the rows of WO as
WO=[W1OW2OvdotsWHO]inRHdhesddout
such that W1O,W2O,dots,WHOinRdbeeddoat. Show that if we set the parameters of an MHA2
model as WhVlarrWhVWhO for h=1,dots,H and keep all other parameters the same, then
the MHA1 and MHA2 models are equivalent, i.e.,(MHA1(Q,K,V)=MHA2(Q,K,V)
for all inputs Q,K,V.
(b) How many trainable parameters do MHA1 and MHA2 have?
(c) If dV=dout=512 and dbead=64, what is the difference in the number of trainable
parameters?
 Problem 4: Why output projection on MHA? Consider the standard multi-head

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!