Question: We define multi head self attention l ke below: Y ( x ) = concat [ H 1 , dots, H H ] w (

We define multi head self attention lke below:
Y(x)= concat [H1,dots,HH]w(0)
Hh=Softmax[QhKTVhDkh]Vh
Qh=xW(q)h,Kh=xW(K)h,Vh=xw(v)h
It includes some redundancy in consecutive multiplications of matrix w(v) corresponding to every head and also output matrix w(0). Removing this redundancy enables us to write multi head self attention as sum of the effect of every head. Now prove we can write multihead self attention formula as below:
Y(x)=Hh=1 Softmax [QhKThDkh2]w(h)
(Hint: w(h) equals to w(v) Wh(o) if we devide matrix w(0) in horizontal direction as the number of heads then wh(o) is for the hth head)
We define multi head self attention l ke below: Y

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Accounting Questions!