Question: Problem 4 : Why output projection on MHA? Consider the standard multi - head self - attention ( MHA ) layer defined by ) 1
Problem : Why output projection on MHA? Consider the standard multihead selfattention
MHA layer defined by
where
Of course, it is often the case that Let us call this model MHA
Next, consider a variant that we call MHA
where
a Given an MHA model, decompose the rows of as
such that dots, Show that if we set the parameters of an MHA
model as for dots, and keep all other parameters the same, then
the MHA and MHA models are equivalent, ieMHAMHA
for all inputs
b How many trainable parameters do MHA and MHA have?
c If and what is the difference in the number of trainable
parameters?
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
