Question: Machine learning is getting really popular nowadays, and one big reason is its ability to process natural language. What people don't usually know is that
Machine learning is getting really popular nowadays, and one big reason is its ability to process natural language. What people don't usually know is that these Large Language Models LLMs do not actually understand text. Instead, they look into a tokenized version of the text. To do so they require a tokenizer.
A tokenizer converts text into a sequence of tokens. One simple tokenization method consists of converting words into numbers. First, the method identifies all unique words in a corpora, and then assign a different number to each of these words.
Your task is to implement this tokenizer in C language. Your program will receive a file with some text as input. It must find all unique words and assign numbers from to N in alphabetical order, where N is the number of unique words in the document. The first word in alphabetical order receives the number the second word in alphabetical order receives the number and so on Finally, you must convert the document into a sequence of tokens.
Input format
Your program will be executed with two command line arguments. The first argument specifies the name of the file that contains the text to be processed, and the second argument specifies the name of the file where you must write the expected answer following the format below. The input file is guaranteed to have at most characters, and all characters are lowercase letters, whitespaces points or newlines
Output
In the first line of the output, print N In the next N lines, print the unique words found in the input file in alphabetical order. Finally, for each sentence in the input file, print one line of numbers corresponding to the words in the same order they appear in the sentence separated by whitespaces. Sentences are separated by points in the input file. There are no empty sentences, and all sentences end with a point.
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
