This section needs to be completed using Python 3.6+. You will also require following packages:
Question:
This section needs to be completed using Python 3.6+. You will also require following packages:
• pandas
• numpy
• NLTK or SpaCy
• scikit-learn
• random
Q1. Text Generation using N-grams: Code [15]
1. Download the .txt file of the book "The Great Gatsby" from the Gutenberg Project1 . Write a program to read the book and preprocess it by first applying sentence tokenizer2 and then word tokenizer3 . Make sure to remove all the punctuations and stopwords.
2. Add a token at the start and at the end of each sentence and then generate a dictionary containing the bigrams and their frequencies. The sentence 'I love nlp' will become ' I love nlp ' so the bigrams for this sentence will be ' I', 'I love', 'love nlp', 'nlp '.
3. With the following formula, calculate the conditional probability of each word given the previous word.
????(???????? |????????−1) = ????????????????????(????????−1, ????????)/????????????????????(????????−1)
4. Using 'He' as the first token, generate the next 5 words as follows:
• For word ????????−1 get the probability of all other words ???????? given the word ????????−1, and make a list of the first 10 words with the highest probability.
• Use method random.choice4 on the generated list to get a random word with high probability.
• Continue the process till you generate the next 5 words or encounter a '' token.
5. With the perplexity metric as defined below, evaluate the performance of the model for the generated sequence obtained from the previous step.
????????(????) = ???? √ ???? ∏ ????=1 1 ????(???????? |????????−1)
To avoid underflow, use log space to calculate the perplexity metric.
log ????????(????) = − 1 ???? ???? ∑ ????=1 log ????(???????? |????????−1) 1
Income Tax Fundamentals 2013
ISBN: 9781285586618
31st Edition
Authors: Gerald E. Whittenburg, Martha Altus Buller, Steven L Gill