Question: BERT Large model has 2 4 layers, 1 0 2 4 - dim per wordpiece token, and 1 6 self - attention heads. The input
BERT Large model has layers, dim per wordpiece token, and selfattention heads. The input into the first selfattention layer of BERT Large is Sequence Length x ie we use it without batching The Sequence Length is for our sequence. Calculate the number of entries scalars inside a single attention matrix for a single attentiorhead in BERT Large for this sequence.
Your answer shouldbeaninteger
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
