Question: [ ]: 1. Extract features using TF-IDF vectorizer with a feature size 10000 and a ngram range as 1-3; (20 points) 2. Build your

[ ]: 1. Extract features using TF-IDF vectorizer with a feature size 10000 and a ngram_range as 1-3; (20Task 3: Build your 2nd classifier using Logistic Regression def build_LR_classifier(x, y): 1: Parameters: xTask 5: Evaluate your models on the development set, and choose the best model :f1_nb_score = f1_evaluator

[ ]: 1. Extract features using TF-IDF vectorizer with a feature size 10000 and a ngram range as 1-3; (20 points) 2. Build your 1st classifier using Naive Bayes (20 points) 3. Build your 2nd classifier using Logistic Regression: (20 points) 4. Build your model evaluator using F1-score: (10 points) 5. Evaluate your models on the development set, and choose the best model; (10 points) 6. Apply your best model on the test set and conduct evaluation. (20 points) Note . For this homework, we will use a Python toolkit/package called "scikit-learn. . For the preprocessing, we start with tokenization using NLTKC tokenizer. You may try the other tokenization methods in the last bonus problem. . For the feature extraction, we will use the feature extractors from here scikit-learn. . For the text classifiers, we start with Naive Bayes and Logistic Regression classifiers, but you can explore other classifiers at this page, which has a list of supervised machine learning models. . We will use the evaluation package to evaluate your model performance. Feel free to read through the linked documentation, which will be helpful for you to finish the homework challenges. [1]# you should define all your import packages here from sklearn.linear_model import LogisticRegression, Perceptron from sklearn. feature_extraction.text import CountVectorizer, TfidfVectorizer from sklearn.metrics import f1 score, accuracy_score # Load and preprocess datasets def load_and_preprocess(data_path, test=False): *** Load and preprocess your datasets Parameters: data_path (str): your data path test (bool): if the file is a test file x = [] y = [] with open (data_path) as dfile: cols = dfile.readline().strip().split("\t') review_idx = cols.index('review') rating idx = cols.index( 'rating') for line in dfile: if len(line) < 5: continue line line.strip().split("\t') x.append(line [review_idx]) y.append(int(line[rating idx])) return x, y 0 Python 3 (ipykemel [23:# Load your training, development, and test datasets train_x, train_y= load_and_preprocess('./data/train.tsv') # training set dev_x, dev_y= load_and_preprocess(/data/dev.tsv') # development set test_x, test_y= load_and_preprocess("./data/test.tsv, test-True) # test set Task 3: Build your 2nd classifier using Logistic Regression def build_LR_classifier(x, y): 1: Parameters: x (scipy.sparse.csr.csr_matrix): document features y (list): a list of document labels #Implement your function here pass ir_clf = build_LR_classifier(train_x_feats, train_y) Task 4: Build your model evaluator using F1-score def f1_evaluator(x, y_truth, clf): Parameters: x (scipy.sparse.csr.csr_satrix): document features y_true (list): a list of original document labels clf (object): a well-trained text classifier www result = e.e #Implement your function here print (result) return result 51: ## Example codes of evaluating models by accuracy # dev_y_pred= pn_clf.predict(dev_x_feats) #print("Accuracy: sum(dev_y_pred dev_y) / Len(dev_y)) Accuracy: 0.541 Task 5: Evaluate your models on the development set, and choose the best model 1f1_nb_score= f1_evaluator (dev_x_feats, dev_y, nb_clf) f1_c1f_score= f1_evaluator(dev_x_feats, dev_y, lr_clf) 3:# Implement your code here to find out the best classifier # if the NB classifier performs better than the LR classifer, assign the NB classifier as the best clf #else, assign the LR classifier as the best clf # You should compare the fl-scores of your models, name the best model as "best_clf" best_clf = ? Task 5: Evaluate your models on the development set, and choose the best model :f1_nb_score = f1_evaluator (dev_x_feats, dev_y, nb_clf) f1_c1f_score= f1_evaluator (dev_x_feats, dev_y, lr_clf) # Implement your code here to find out the best classifier # if the NB classifier performs better than the LR classifer, assign the NB classifier as the best clf # else, assign the LR classifier as the best clf # You should compare the f1-scores of your models, name the best model as "best_clf" best_clf = ? Task 6: Apply your best model on the test set 1: # implement your code here, # first, you have to Load your test data test_x, test_y = ? #next, you will encode the documents into features test_x_feats = ? # evaluate your test result using the features, document Labels and your best clf your_score = ?

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock

It seems like youre working on building and evaluating text classifiers using Naive Bayes and Logistic Regression models Lets go through each task ste... View full answer

blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Programming Questions!