Question: Transformers models use a hybrid approach between word - level and character - level tokenization called subword tokenization. BPE ( Byte - Pair Encoding )

Transformers models use a hybrid approach between word-level and character-level tokenization called subword
tokenization. BPE(Byte-Pair Encoding) is a subword-level tokenization approach introduced in Neural Machine
Translation of Rare Words with Subword Units (Sennrich et al.,2015).. BPE relies on a pre-tokenizer that
splits the training data into words. Pretokenization can be as simple as space tokenization. Let us assume that
after pre-tokenization, the following set of words including their frequency has been determined:
(old,10),(older,5),(oldest,8),(hug,8),(pug,4),(hugs,5)
We obtain an base vocabulary:
o,l,d,e,r,s,t,h,u,g,p
Splitting all words into symbols in the base vocabulary, we obtain:
(o,l,d,10),(o,l,d,e,r,5),(o,l,d,e,s,t,8),(h,u,g,8),(p,u,g,4),(h,u,g,s,5)
BPE then counts the frequency of each possible symbol pair and picks the symbol pair that occurs most frequently.
In the above example, "o" followed by "l" is present 10+5+8=23 times. Thus, the first merge rule the tokenizer
learns is to group all "o" symbols followed by an "l" symbol together. Next, "ol" is added to the vocabulary.
The set of words then becomes:
(ol,d,10),(ol,d,e,r,5),(ol,d,e,s,t,8),(h,u,g,8),(p,u,g,4),(h,u,g,s,5)
This process will run iteratively. The vocabulary size, i.e. the base vocabulary size + the number of merges,
is a hyperparameter to choose. The learned merge rules would then be applied to new words (as long as those
new words do not include symbols that were not in the base vocabulary). The word not in the base vocabulary
would be repersented as "[unk]". Implement this BPE tokenizer, set the vocabulary size as 16 and train this
BPE tokenizer to finish the iterative proecss. Use the trained tokenizer to tokenize the words below:(15 marks)
{hold,oldest,older,pug, mug, huggingface}.
 Transformers models use a hybrid approach between word-level and character-level tokenization

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!