(REALLY NEED HELP CREATING THIS CODE IN FULL AND ITS COMPLETE ENTIRETY... ALL OF THE DETAILS ARE...

Fantastic news! We've Found the answer you've been seeking!

Question:

(REALLY NEED HELP CREATING THIS CODE IN FULL AND ITS COMPLETE ENTIRETY... ALL OF THE DETAILS ARE PROVIDED AND THE CODE SHOULD HAVE EACH PART FOR EACH QUESTION LABELED SEPARATELY... THE REQUIRED LINKS TO CREATE THE PATHWAYS ARE ALL LISTED AT THE END OF THE INSTRUCTIONS AND QUESTIONS... PROVIDE OUTPUT AS WELL... PLEASE HELP ME AND I PROMISE TO LEAVE GOOD FEEDBACK AND REVIEWS)

1 Introduction
In this assignment, you will learn to build a simple text generation model using N-grams. You will also implement a Part-of-speech tagger using Hidden Markov Models and Conditional Random Fields. POS tagging marks each word with the grammatical tags essential for understanding the sentence structure and tells us about the likelihood of neighboring words and syntactic structure making it an integral part of parsing.

N-grams make use of the chain rule of probability and approximation to find the probability of a word occurring given previous N words. This helps us assign the probability of the next word and generate the text in a sequential manner.
Hidden Markov Model is a generative sequence model where it computes the probability over a sequence of hidden labels given a sequence of observable input. One of the many applications of HMMs is POS tagging. In a given sentence, HMM learns the probability of assigning a hidden grammatical tag sequence given a sequence of observable input of words. In this assignment, you will learn to implement the HMM using Viterbi algorithm.
Conditional Random Field is a discriminative sequence model with prediction tasks where contextual information or the state of the neighbors affects the current prediction. Where HMMs use the Markov assumption, imposing a dependency only on the previous element. In contrast, CRF allows the modeling of arbitrary feature functions.

2 Instructions

Code
This section needs to be completed using Python 3.6+. You will also require following packages:

pandas
numpy
NLTK or SpaCy
scikit-learn
random

If you want to use an external package for any reason, you are required to get approval from the course staff prior to submission.

3 Questions

Q1. Text Generation using N-grams: Code
In this question, you will build a language model using bigrams, which approximates the probability of a word ????_n given all the previous words ????(????_n|????_1:n-1) by using only the conditional probability of the preceding word ????(????????|????????−1). Then for a given text, using the chain rule,

p(d)=p(w1:N)=∏k=1Np(wk=∣wk−1)

You will then evaluate the model using the metric 'Perplexity'. Complete the following tasks:

Download the .txt file of the book "The Great Gatsby" from the Gutenberg Project¹. Write a program to read the book and preprocess it by first applying sentence tokenizer²and then word tokenizer³. Make sure to remove all the punctuations and stopwords.
Add a ~~token at the start and~~ at the end of each sentence and then generate a dictionary containing the bigrams and their frequencies. The sentence 'I love nlp' will become ' ~~I love nlp~~ ' so the bigrams for this sentence will be ' I', 'I love', 'love nlp', 'nlp '.
With the following formula, calculate the conditional probability of each word given the previous word. p(wi∣wi=1)=count(wi−1)count(wi−1),wi
Using 'He' as the first token, generate the next 5 words as follows:
• For word ????_i=1 get the probability of all other words ????_i given the word ????_i-1make a list of the first 10 words with the highest probability.
• Use method random.choice⁴ on the generated list to get a random word with high probability.
• Continue the process till you generate the next 5 words or encounter a '' token.
With the perplexity metric as defined below, evaluate the performance of the model for the generated sequence obtained from the previous step.
PP(W)=N∏i=1Np(wi∣wi−1)1
To avoid underflow, use log space to calculate the perplexity metric.
logPP(W)=−N1∑i=1Nlogp(wi∣wi=1)

Q2. POS Tagger - HMM: Code

You will learn to build Hidden Markov Model using the Viterbi algorithm and apply it to the task of POS tagging. Complete each of the following tasks.

Load NLTK Treebank tagged sentences using nltk.corpus.treebank.tagged_sents(). Use first 80% of sentences for training and the remaining 20% for testing.
Extract the word and the tag from each of the sentences and create a vocabulary of all the words and a set of all tags.
To implement the Viterbi algorithm, you need 2 components,
• Tag transition probability matrix A: It represents the probability of a tag occurring given the previous tag or ????(????_????|????????₋₁). We compute the maximum likelihood estimate (MLE) of the probability by counting the occurrences of the tag ????_i-1 followed by tag ????_i .
p(ti∣ti−1)=count(ti−1)count(tt−1,t1)
• Emission probability matrix B: It represents the probability of a tag ???? being ???? associated with a given word ???? or ????(???? |???? ). MLE estimate is:
p(wi∣ti)=count(ti)count(ti,wi)
Since the number of tags is smaller, creating matrix A is time efficient whereas generation of matrix B will be very expensive due to vocabulary size.
Implement a method compute_tag_trans_probs() to calculate matrix A by parsing the sentences in the training set and counting the occurrences of the tag ????_i-1 followed by ????_i.
Implement a method emission_probs() to calculate emission probability of a given word ????_i having a tag ????_i.
Next step in HMM is decoding which entails determining the hidden variable sequence of observations. In POS tagging, decoding is to choose the sequence of tags most probable to the sequence of words. We compute this using the following equation,
t^i:n=argmaxt1..tn∏i=1np(wi∣ti)p(ti∣ti−1)
The optimal solution for HMM decoding is given by the Viterbi algorithm, a dynamic approach to the computation of the decoded tags. Implement the algorithm using the two methods, compute_tag_trans_probs() and emission_probs() implemented above and return the sequence of tags corresponding to the given sequence of words. Refer to section 8.4.5, Fig. 8.10 of Speech and Language Processing book⁵.
Evaluate the performance of the model in terms of accuracy on the test set.

Q3. POS Tagger - CRF: Code
You will learn to implement a POS tagger using CRF. Complete each of the following tasks.

Load NLTK Treebank tagged sentences using nltk.corpus.treebank.tagged_sents(). Use first 80% of the sentences for training and the remaining 20% for the testing.
Extract the word and the tag from each of the sentences and create a vocabulary of all the words and a set of all tags.
Build the following feature set for each token/word:
• The current token/word
• Is the word a number? (boolean value)
• Does the word contain any hyphens? (boolean value) • Is the word all uppercase? (boolean value)
• Does the word have any uppercase letters? (boolean value) • Is the word all lowercase? (boolean value)
• Length of the word
• Bigrams of the word
Use the CRF model from sklearn_crfsuite library⁶, and train it with feature set built above.
Evaluate the performance of the model in terms of accuracy on the test set.

LINKS

link to book:

https://www.gutenberg.org/ebooks/64317

https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.sent_tokenize

https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.word_tokenize

https://docs.python.org/3/library/random.html#random.choice

https://web.stanford.edu/~jurafsky/slp3/8.pdf

https://sklearn-crfsuite.readthedocs.io/en/latest/api.html

Related Book For answer-question

Income Tax Fundamentals 2013

ISBN: 9781285586618

31st Edition

Authors: Gerald E. Whittenburg, Martha Altus Buller, Steven L Gill

See More Books

Posted Date: Feb 28, 2024 01:09 AM

(REALLY NEED HELP CREATING THIS CODE IN FULL AND ITS COMPLETE ENTIRETY... ALL OF THE DETAILS ARE...

Question:

Expert Answer:

Income Tax Fundamentals 2013

Students also viewed these algorithms questions