Question: [ ]: 1. Extract features using TF-IDF vectorizer with a feature size 10000 and a ngram range as 1-3; (20 points) 2. Build your

Task 3: Build your 2nd classifier using Logistic Regression def build_LR_classifier(x, y): 1: Parameters: x

Task 5: Evaluate your models on the development set, and choose the best model :f1_nb_score = f1_evaluator

[ ]: 1. Extract features using TF-IDF vectorizer with a feature size 10000 and a ngram range as 1-3; (20 points) 2. Build your 1st classifier using Naive Bayes (20 points) 3. Build your 2nd classifier using Logistic Regression: (20 points) 4. Build your model evaluator using F1-score: (10 points) 5. Evaluate your models on the development set, and choose the best model; (10 points) 6. Apply your best model on the test set and conduct evaluation. (20 points) Note . For this homework, we will use a Python toolkit/package called "scikit-learn. . For the preprocessing, we start with tokenization using NLTKC tokenizer. You may try the other tokenization methods in the last bonus problem. . For the feature extraction, we will use the feature extractors from here scikit-learn. . For the text classifiers, we start with Naive Bayes and Logistic Regression classifiers, but you can explore other classifiers at this page, which has a list of supervised machine learning models. . We will use the evaluation package to evaluate your model performance. Feel free to read through the linked documentation, which will be helpful for you to finish the homework challenges. [1]# you should define all your import packages here from sklearn.linear_model import LogisticRegression, Perceptron from sklearn. feature_extraction.text import CountVectorizer, TfidfVectorizer from sklearn.metrics import f1 score, accuracy_score # Load and preprocess datasets def load_and_preprocess(data_path, test=False): *** Load and preprocess your datasets Parameters: data_path (str): your data path test (bool): if the file is a test file x = [] y = [] with open (data_path) as dfile: cols = dfile.readline().strip().split("\t') review_idx = cols.index('review') rating idx = cols.index( 'rating') for line in dfile: if len(line) < 5: continue line line.strip().split("\t') x.append(line [review_idx]) y.append(int(line[rating idx])) return x, y 0 Python 3 (ipykemel [23:# Load your training, development, and test datasets train_x, train_y= load_and_preprocess('./data/train.tsv') # training set dev_x, dev_y= load_and_preprocess(/data/dev.tsv') # development set test_x, test_y= load_and_preprocess("./data/test.tsv, test-True) # test set Task 3: Build your 2nd classifier using Logistic Regression def build_LR_classifier(x, y): 1: Parameters: x (scipy.sparse.csr.csr_matrix): document features y (list): a list of document labels #Implement your function here pass ir_clf = build_LR_classifier(train_x_feats, train_y) Task 4: Build your model evaluator using F1-score def f1_evaluator(x, y_truth, clf): Parameters: x (scipy.sparse.csr.csr_satrix): document features y_true (list): a list of original document labels clf (object): a well-trained text classifier www result = e.e #Implement your function here print (result) return result 51: ## Example codes of evaluating models by accuracy # dev_y_pred= pn_clf.predict(dev_x_feats) #print("Accuracy: sum(dev_y_pred dev_y) / Len(dev_y)) Accuracy: 0.541 Task 5: Evaluate your models on the development set, and choose the best model 1f1_nb_score= f1_evaluator (dev_x_feats, dev_y, nb_clf) f1_c1f_score= f1_evaluator(dev_x_feats, dev_y, lr_clf) 3:# Implement your code here to find out the best classifier # if the NB classifier performs better than the LR classifer, assign the NB classifier as the best clf #else, assign the LR classifier as the best clf # You should compare the fl-scores of your models, name the best model as "best_clf" best_clf = ? Task 5: Evaluate your models on the development set, and choose the best model :f1_nb_score = f1_evaluator (dev_x_feats, dev_y, nb_clf) f1_c1f_score= f1_evaluator (dev_x_feats, dev_y, lr_clf) # Implement your code here to find out the best classifier # if the NB classifier performs better than the LR classifer, assign the NB classifier as the best clf # else, assign the LR classifier as the best clf # You should compare the f1-scores of your models, name the best model as "best_clf" best_clf = ? Task 6: Apply your best model on the test set 1: # implement your code here, # first, you have to Load your test data test_x, test_y = ? #next, you will encode the documents into features test_x_feats = ? # evaluate your test result using the features, document Labels and your best clf your_score = ?

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock

It seems like youre working on building and evaluating text classifiers using Naive Bayes and Logistic Regression models Lets go through each task ste... View full answer

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!

What is Financial Forecasting and it's components and importance of Financial Forecasting with example? Explain the advantages and disadvantages of Financial Forecasting?

David, an actuary, is explaining that insurance rates should have five characteristics. Which one of the following correctly describes what David means when he says that rates should provide for...

Planning is one of the most important management functions in any business. A front office managers first step in planning should involve determine the departments goals. Planning also includes...

In an experiment to ascertain the effect of load, x, in lb/inches 2 on the probability of failure of specimens of a certain fabric type, an experiment was conducted in which numbers of specimens were...

import numpy as np #machine learning tool used for efficient array processing import pandas as pd #machine learning tool used for data sets and data frames from sklearn.model _ selection import train...

1, 3, 5-hexatriene (a kind of 'linear' benzene) was converted into benzene itself. On the basis of a free-electron molecular orbital model (in which hexatriene is treated as a linear box and benzene...

This exercise concerns the classification of spam email. Create a corpus of spam email and one of non-spam mail. Examine each corpus and decide what features appear to be useful for classification:...

A non-angler, curious to know what counts as a good day of fishing (F) at the lake and puzzled by the phenomenon that it can sometimes be a good day even when no fish are caught, decides to create a...

The transient points of an electronic inverter are influenced by many factors. Table 12-20 gives data on the transient point (y, in volts) of PMOS-NMOS inverters and five candidate regressors:, x1 =...

Consider the jet engine thrust data in Exercise 12-64. (a) Use all possible regressions to select the best regression equation, where the model with the minimum value of MSE is to be selected as...

A 10-year, 8% coupon bond with semi-annual coupons is trading at a price equal to $1,064 per $1,000 par value. Compute the quoted yield-to-maturity.

(b) (i) A circuit has a voltage v(t) = vocos (wt) applied to it and draws a current: i(t) = iocos (wt +) where is the constant phase difference between the voltage and current waveforms and all other...

The partnership of Waterway, Wildhorse, and Sheffield engaged you to adjust its accounting records and convert them uniformly to the accrual basis in anticipation of admitting Kerns as a new partner....

[The following information applles to the questions displayed below.] Diana and Ryan Workman were married on January 1 of last yeat, Ryan has an eight-year-old son, Jorge, from his previous marriage....

Write Section 1 of the DAA. Provide context for the Caffeine, Exercise, and Heart Rate Dataset. Include a definition of the specified variables, including factors, levels of each factor, and the...

INDIVIDUAL TAX RETURN PROBLEM 2Required: Use the following information to complete Johnelle and Latoya Henry's 2016federal income tax return. If information is missing, use reasonable assumptions to...

Explain why physical security is important? Should physical security be included within the scope of a penetration test? Explain why or why not. Describe your approach for implementing an effective...

By referring to Figure 13.18, determine the mass of each of the following salts required to form a saturated solution in 250 g of water at 30 oC: (a) KClO3, (b) Pb(NO3)2, (c) Ce2(SO4)3.

Consider the following scenario: Two players (N = {1, 2}) must choose between three outcomes = {a, b, c}. The rule they use is the following: Player 1 goes first, and vetoes one of the outcomes....

Write out a general algorithm for answering queries of the form P(Cause|e), using a naive Bayes distribution. Assume that the evidence e may assign values to any subset of the effect variables.

Experiment with online sequence-to-sequence neural machine translation model, such as translate.google.com/ or www.bing.com/translator or www.deepl. com/translator. If you know two languages well,...

Charter Southeast Airlines states that the flight between Fort Lauderdale, Florida, and Los Angeles takes 5 hours and 37 minutes. Assume that the actual flight times are uniformly distributed between...

Suppose a random variable, x, has a uniform distribution with a = 5 and b = 9. a. Calculate P(5.5 x 8). b. Determine P (x > 7). c. Compute the mean, , and standard deviation, , of this random...

A senior loan officer for Whitney National Bank has recently studied the banks real estate loan portfolio and found that the distribution of loan balances is approximately normally distributed with a...