Question: BERT Large model has 2 4 layers, 1 0 2 4 - dim per wordpiece token, and 1 6 self - attention heads. The input

BERT Large model has 24 layers, 1024-dim per wordpiece token, and 16 self-attention heads. The input into the first self-attention layer of BERT Large is Sequence Length x 1024(i.e., we use it without batching). The Sequence Length is 5 for our sequence. Calculate the number of entries (scalars) inside a single attention matrix for a single attentiorhead in BERT Large for this sequence.
Your answer shouldbeaninteger.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!