Question: Feature Selection We will evaluate the two feature selection methods: the chi-squared method and the mutual information method, to find out how they performs for
Feature Selection
We will evaluate the two feature selection methods: the chi-squared method and the mutual information method, to find out how they performs for this dataset. The following example shows how to use them in sklearn.
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, mutual_info_classif
X = feature_vectors
y = targets
X_new1 = SelectKBest(chi2, k=100).fit_transform(X, y)
X_new2 = SelectKBest(mutual_info_classif, k=100).fit_transform(X, y)
The selection of K depends on the actual number of features available. For this dataset, it can be in the range from a few hundreds to a few thousands. Please test with both methods for a number of Ks and the four classifiers to find out whether feature selection helps for this dataset and report with figures of (x-axis: K, y-axis:f1_macro). Discuss what you have observed. Put the code in "feature_selection.py"
A training_data_file is in the libsvm format, i.e., the sparse representation of the document-term matrix. see a sample libsvm file. It's a text file, where each row represents one document and takes the form < class label> : : .... This format is especially suitable for sparse datasets - all the zero-value features do not need to be recorded. According to the newsgroup directory that the document belongs to, you can find the class label from the mapping defined in the class_definition_file. Each feature-id refers to the term defined in feature_definition_file. Depending on the types of values you like, the feature value can be term frequency (TF), inverse document frequency (IDF), TFIDF, or Boolean (for Bernoulli Naive Bayes). You can generate all training data files at once after running feature-extract.py: e.g., training_data_file.TF, training_data_file.IDF, training_data_file.TFIDF, and training_data_file.Boolean. Please carefully test the feature values you generated. Your experiments on classification and clustering may select different training data. For example, Bernoulli Naive Bayes classifiers may prefer Boolean training data.
Once you have the training data in the libsvm format, you can easily load it into sklearn as follows.
from sklearn.datasets import load_svmlight_file
feature_vectors, targets = load_svmlight_file("/path/to/train_dataset.txt")
Data pre-processing has been one of the most time-consuming and critical steps in data mining. Make sure you thoroughly test your code.
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
