Question: [ ]: 1. Extract features using TF-IDF vectorizer with a feature size 10000 and a ngram range as 1-3; (20 points) 2. Build your
![[ ]: 1. Extract features using TF-IDF vectorizer with a feature size 10000 and a ngram_range as 1-3; (20](https://dsd5zvtm8ll6.cloudfront.net/si.experts.images/answers/2023/10/651dc488845b4_672651dc4887c7ec.jpg)

[ ]: 1. Extract features using TF-IDF vectorizer with a feature size 10000 and a ngram range as 1-3; (20 points) 2. Build your 1st classifier using Naive Bayes (20 points) 3. Build your 2nd classifier using Logistic Regression: (20 points) 4. Build your model evaluator using F1-score: (10 points) 5. Evaluate your models on the development set, and choose the best model; (10 points) 6. Apply your best model on the test set and conduct evaluation. (20 points) Note . For this homework, we will use a Python toolkit/package called "scikit-learn. . For the preprocessing, we start with tokenization using NLTKC tokenizer. You may try the other tokenization methods in the last bonus problem. . For the feature extraction, we will use the feature extractors from here scikit-learn. . For the text classifiers, we start with Naive Bayes and Logistic Regression classifiers, but you can explore other classifiers at this page, which has a list of supervised machine learning models. . We will use the evaluation package to evaluate your model performance. Feel free to read through the linked documentation, which will be helpful for you to finish the homework challenges. [1]# you should define all your import packages here from sklearn.linear_model import LogisticRegression, Perceptron from sklearn. feature_extraction.text import CountVectorizer, TfidfVectorizer from sklearn.metrics import f1 score, accuracy_score # Load and preprocess datasets def load_and_preprocess(data_path, test=False): *** Load and preprocess your datasets Parameters: data_path (str): your data path test (bool): if the file is a test file x = [] y = [] with open (data_path) as dfile: cols = dfile.readline().strip().split("\t') review_idx = cols.index('review') rating idx = cols.index( 'rating') for line in dfile: if len(line) < 5: continue line line.strip().split("\t') x.append(line [review_idx]) y.append(int(line[rating idx])) return x, y 0 Python 3 (ipykemel [23:# Load your training, development, and test datasets train_x, train_y= load_and_preprocess('./data/train.tsv') # training set dev_x, dev_y= load_and_preprocess(/data/dev.tsv') # development set test_x, test_y= load_and_preprocess("./data/test.tsv, test-True) # test set Task 3: Build your 2nd classifier using Logistic Regression def build_LR_classifier(x, y): 1: Parameters: x (scipy.sparse.csr.csr_matrix): document features y (list): a list of document labels #Implement your function here pass ir_clf = build_LR_classifier(train_x_feats, train_y) Task 4: Build your model evaluator using F1-score def f1_evaluator(x, y_truth, clf): Parameters: x (scipy.sparse.csr.csr_satrix): document features y_true (list): a list of original document labels clf (object): a well-trained text classifier www result = e.e #Implement your function here print (result) return result 51: ## Example codes of evaluating models by accuracy # dev_y_pred= pn_clf.predict(dev_x_feats) #print("Accuracy: sum(dev_y_pred dev_y) / Len(dev_y)) Accuracy: 0.541 Task 5: Evaluate your models on the development set, and choose the best model 1f1_nb_score= f1_evaluator (dev_x_feats, dev_y, nb_clf) f1_c1f_score= f1_evaluator(dev_x_feats, dev_y, lr_clf) 3:# Implement your code here to find out the best classifier # if the NB classifier performs better than the LR classifer, assign the NB classifier as the best clf #else, assign the LR classifier as the best clf # You should compare the fl-scores of your models, name the best model as "best_clf" best_clf = ? Task 5: Evaluate your models on the development set, and choose the best model :f1_nb_score = f1_evaluator (dev_x_feats, dev_y, nb_clf) f1_c1f_score= f1_evaluator (dev_x_feats, dev_y, lr_clf) # Implement your code here to find out the best classifier # if the NB classifier performs better than the LR classifer, assign the NB classifier as the best clf # else, assign the LR classifier as the best clf # You should compare the f1-scores of your models, name the best model as "best_clf" best_clf = ? Task 6: Apply your best model on the test set 1: # implement your code here, # first, you have to Load your test data test_x, test_y = ? #next, you will encode the documents into features test_x_feats = ? # evaluate your test result using the features, document Labels and your best clf your_score = ?
Step by Step Solution
There are 3 Steps involved in it
It seems like youre working on building and evaluating text classifiers using Naive Bayes and Logistic Regression models Lets go through each task ste... View full answer
Get step-by-step solutions from verified subject matter experts
