Question: Classifier Evaluation. For more realistic evaluations, you will need to partition the data by date and train with older data and test on newer data

Classifier Evaluation. For more realistic evaluations, you will need to partition the data by date and train with older data and test on newer data (i.e., the heldout data). This project does not require you to follow this procedure. However, we still want to use cross validation for the whole dataset to get more reliable estimation of the classifier performance - you can get both the mean and standard deviation of a selected metric. The following code snippet shows how to use cross validation in evaluation.

from sklearn.model_selection import cross_val_score scores = cross_val_score(clf, feature_vectors, targets, cv=5, scoring='f1_macro') print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2)) 

This example uses 5-fold cross-validation (cv=5) and the metric is f1_macro, which is the macro-averaging of the F1 scores for different classes (why macro-averaging?).

For each classifier, please report the mean and 2*std of 5-fold f1_macro, precision_macro, and recall_macro, respectively. Hint: good classifiers for this dataset have F1 score higher than 0.5; for SVM, normalizing feature values may help the performance; for kNN, you may try a few different settings of n_neighbors to find the best one. Discuss what you have observed and put the related code in "classification.py".

-python

A training_data_file is in the libsvm format, i.e., the sparse representation of the document-term matrix. see a sample libsvm file. It's a text file, where each row represents one document and takes the form < class label> : : .... This format is especially suitable for sparse datasets - all the zero-value features do not need to be recorded. According to the newsgroup directory that the document belongs to, you can find the class label from the mapping defined in the class_definition_file. Each feature-id refers to the term defined in feature_definition_file. Depending on the types of values you like, the feature value can be term frequency (TF), inverse document frequency (IDF), TFIDF, or Boolean (for Bernoulli Naive Bayes). You can generate all training data files at once after running feature-extract.py: e.g., training_data_file.TF, training_data_file.IDF, training_data_file.TFIDF, and training_data_file.Boolean. Please carefully test the feature values you generated. Your experiments on classification and clustering may select different training data. For example, Bernoulli Naive Bayes classifiers may prefer Boolean training data.
Once you have the training data in the libsvm format, you can easily load it into sklearn as follows.
from sklearn.datasets import load_svmlight_file
feature_vectors, targets = load_svmlight_file("/path/to/train_dataset.txt")
Data pre-processing has been one of the most time-consuming and critical steps in data mining. Make sure you thoroughly test your code.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!