(REALLY NEED HELP CREATING THIS CODE IN FULL AND ITS COMPLETE ENTIRETY... ALL OF THE DETAILS ARE...
Question:
(REALLY NEED HELP CREATING THIS CODE IN FULL AND ITS COMPLETE ENTIRETY... ALL OF THE DETAILS ARE PROVIDED AND THE CODE SHOULD HAVE EACH PART FOR EACH QUESTION LABELED SEPARATELY... THE REQUIRED LINKS TO CREATE THE PATHWAYS ARE ALL LISTED AT THE END OF THE INSTRUCTIONS AND QUESTIONS... PROVIDE OUTPUT AS WELL... PLEASE HELP ME AND I PROMISE TO LEAVE GOOD FEEDBACK AND REVIEWS)
1 Introduction
In this assignment, you will learn to build a simple text generation model using N-grams. You will also implement a Part-of-speech tagger using Hidden Markov Models and Conditional Random Fields. POS tagging marks each word with the grammatical tags essential for understanding the sentence structure and tells us about the likelihood of neighboring words and syntactic structure making it an integral part of parsing.
- N-grams make use of the chain rule of probability and approximation to find the probability of a word occurring given previous N words. This helps us assign the probability of the next word and generate the text in a sequential manner.
- Hidden Markov Model is a generative sequence model where it computes the probability over a sequence of hidden labels given a sequence of observable input. One of the many applications of HMMs is POS tagging. In a given sentence, HMM learns the probability of assigning a hidden grammatical tag sequence given a sequence of observable input of words. In this assignment, you will learn to implement the HMM using Viterbi algorithm.
- Conditional Random Field is a discriminative sequence model with prediction tasks where contextual information or the state of the neighbors affects the current prediction. Where HMMs use the Markov assumption, imposing a dependency only on the previous element. In contrast, CRF allows the modeling of arbitrary feature functions.
2 Instructions
Code
This section needs to be completed using Python 3.6+. You will also require following packages:
- pandas
- numpy
- NLTK or SpaCy
- scikit-learn
- random
If you want to use an external package for any reason, you are required to get approval from the course staff prior to submission.
3 Questions
Q1. Text Generation using N-grams: Code
In this question, you will build a language model using bigrams, which approximates the probability of a word ????n given all the previous words ????(????n|????1:n-1) by using only the conditional probability of the preceding word ????(????????|????????−1). Then for a given text, using the chain rule,
p(d)=p(w1:N)=∏k=1Np(wk=∣wk−1)
You will then evaluate the model using the metric 'Perplexity'. Complete the following tasks:
- Download the .txt file of the book "The Great Gatsby" from the Gutenberg Project1. Write a program to read the book and preprocess it by first applying sentence tokenizer2and then word tokenizer3. Make sure to remove all the punctuations and stopwords.
- Add a
token at the start andat the end of each sentence and then generate a dictionary containing the bigrams and their frequencies. The sentence 'I love nlp' will become 'I love nlp' so the bigrams for this sentence will be 'I', 'I love', 'love nlp', 'nlp'. - With the following formula, calculate the conditional probability of each word given the previous word. p(wi∣wi=1)=count(wi−1)count(wi−1),wi
- Using 'He' as the first token, generate the next 5 words as follows:
• For word ????i=1 get the probability of all other words ????i given the word ????i-1 make a list of the first 10 words with the highest probability.
• Use method random.choice4 on the generated list to get a random word with high probability.
• Continue the process till you generate the next 5 words or encounter a '' token. - With the perplexity metric as defined below, evaluate the performance of the model for the generated sequence obtained from the previous step.
PP(W)=N∏i=1Np(wi∣wi−1)1
To avoid underflow, use log space to calculate the perplexity metric.
logPP(W)=−N1∑i=1Nlogp(wi∣wi=1)
Q2. POS Tagger - HMM: Code
You will learn to build Hidden Markov Model using the Viterbi algorithm and apply it to the task of POS tagging. Complete each of the following tasks.
- Load NLTK Treebank tagged sentences using nltk.corpus.treebank.tagged_sents(). Use first 80% of sentences for training and the remaining 20% for testing.
- Extract the word and the tag from each of the sentences and create a vocabulary of all the words and a set of all tags.
- To implement the Viterbi algorithm, you need 2 components,
• Tag transition probability matrix A: It represents the probability of a tag occurring given the previous tag or ????(????????|????????−1). We compute the maximum likelihood estimate (MLE) of the probability by counting the occurrences of the tag ????i-1 followed by tag ????i .
p(ti∣ti−1)=count(ti−1)count(tt−1,t1)
• Emission probability matrix B: It represents the probability of a tag ???? being ???? associated with a given word ???? or ????(???? |???? ). MLE estimate is:
p(wi∣ti)=count(ti)count(ti,wi)
Since the number of tags is smaller, creating matrix A is time efficient whereas generation of matrix B will be very expensive due to vocabulary size. - Implement a method compute_tag_trans_probs() to calculate matrix A by parsing the sentences in the training set and counting the occurrences of the tag ????i-1 followed by ????i.
- Implement a method emission_probs() to calculate emission probability of a given word ????i having a tag ????i.
- Next step in HMM is decoding which entails determining the hidden variable sequence of observations. In POS tagging, decoding is to choose the sequence of tags most probable to the sequence of words. We compute this using the following equation,
t^i:n=argmaxt1..tn∏i=1np(wi∣ti)p(ti∣ti−1)
The optimal solution for HMM decoding is given by the Viterbi algorithm, a dynamic approach to the computation of the decoded tags. Implement the algorithm using the two methods, compute_tag_trans_probs() and emission_probs() implemented above and return the sequence of tags corresponding to the given sequence of words. Refer to section 8.4.5, Fig. 8.10 of Speech and Language Processing book5. - Evaluate the performance of the model in terms of accuracy on the test set.
Q3. POS Tagger - CRF: Code
You will learn to implement a POS tagger using CRF. Complete each of the following tasks.
- Load NLTK Treebank tagged sentences using nltk.corpus.treebank.tagged_sents(). Use first 80% of the sentences for training and the remaining 20% for the testing.
- Extract the word and the tag from each of the sentences and create a vocabulary of all the words and a set of all tags.
- Build the following feature set for each token/word:
• The current token/word
• Is the word a number? (boolean value)
• Does the word contain any hyphens? (boolean value) • Is the word all uppercase? (boolean value)
• Does the word have any uppercase letters? (boolean value) • Is the word all lowercase? (boolean value)
• Length of the word
• Bigrams of the word - Use the CRF model from sklearn_crfsuite library6, and train it with feature set built above.
- Evaluate the performance of the model in terms of accuracy on the test set.
LINKS
link to book:
https://www.gutenberg.org/ebooks/64317
https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.sent_tokenize
https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.word_tokenize
https://docs.python.org/3/library/random.html#random.choice
https://web.stanford.edu/~jurafsky/slp3/8.pdf
https://sklearn-crfsuite.readthedocs.io/en/latest/api.html
Income Tax Fundamentals 2013
ISBN: 9781285586618
31st Edition
Authors: Gerald E. Whittenburg, Martha Altus Buller, Steven L Gill