Question: Feature Selection We will evaluate the two feature selection methods: the chi-squared method and the mutual information method, to find out how they performs for

Feature Selection

We will evaluate the two feature selection methods: the chi-squared method and the mutual information method, to find out how they performs for this dataset. The following example shows how to use them in sklearn.

from sklearn.feature_selection import SelectKBest

from sklearn.feature_selection import chi2, mutual_info_classif

X = feature_vectors

y = targets

X_new1 = SelectKBest(chi2, k=100).fit_transform(X, y)

X_new2 = SelectKBest(mutual_info_classif, k=100).fit_transform(X, y)

The selection of K depends on the actual number of features available. For this dataset, it can be in the range from a few hundreds to a few thousands. Please test with both methods for a number of Ks and the four classifiers to find out whether feature selection helps for this dataset and report with figures of (x-axis: K, y-axis:f1_macro). Discuss what you have observed. Put the code in "feature_selection.py"

A training_data_file is in the libsvm format, i.e., the sparse representation of the document-term matrix. see a sample libsvm file. It's a text file, where each row represents one document and takes the form < class label> :

: .... This format is especially suitable for sparse datasets - all the zero-value features do not need to be recorded. According to the newsgroup directory that the document belongs to, you can find the class label from the mapping defined in the class_definition_file. Each feature-id refers to the term defined in feature_definition_file. Depending on the types of values you like, the feature value can be term frequency (TF), inverse document frequency (IDF), TFIDF, or Boolean (for Bernoulli Naive Bayes). You can generate all training data files at once after running feature-extract.py: e.g., training_data_file.TF, training_data_file.IDF, training_data_file.TFIDF, and training_data_file.Boolean. Please carefully test the feature values you generated. Your experiments on classification and clustering may select different training data. For example, Bernoulli Naive Bayes classifiers may prefer Boolean training data.

Once you have the training data in the libsvm format, you can easily load it into sklearn as follows.

from sklearn.datasets import load_svmlight_file

feature_vectors, targets = load_svmlight_file("/path/to/train_dataset.txt")

Data pre-processing has been one of the most time-consuming and critical steps in data mining. Make sure you thoroughly test your code.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Jupiter Notebook We have covered some of the limitations of single layer neural networks in class, but they are still powerful learning systems that provide a good way to begin learning about how to...

TANGLEWOOD CASEBOOK for use with STAFFING ORGANIZATIONS 7th Ed. Kammeyer-Mueller 1 TANGLEWOOD CASEBOOK To accompany Staffing Organizations, seventh edition, 2012. Prepared by John Kammeyer-Mueller...

TANGLEWOOD CASEBOOK for use with STAFFING ORGANIZATIONS 5th Ed. Kammeyer-Mueller 1 TANGLEWOOD CASEBOOK To accompany Staffing Organizations, fifth edition, 2006. Prepared by John Kammeyer-Mueller...

Please provide the summary of the methodology and your understanding of this paper. Incluse necessary figures as well. Rapid Object Detection using a Boosted Cascade of Simple Features single feature...

TANGLEWOOD CASEBOOK for use with STAFFING ORGANIZATIONS 5th Ed. Kammeyer-Mueller 1 TANGLEWOOD CASEBOOK To accompany Staffing Organizations, fifth edition, 2006. Prepared by John Kammeyer-Mueller...

TANGLEWOOD CASEBOOK for use with STAFFING ORGANIZATIONS th 8 Ed. Kammeyer-Mueller 1 TANGLEWOOD CASEBOOK To accompany Staffing Organizations, eighth edition, 2015. Prepared by John Kammeyer-Mueller...

QUIZ... Let D be a poset and let f : D D be a monotone function. (i) Give the definition of the least pre-fixed point, fix (f), of f. Show that fix (f) is a fixed point of f. [5 marks] (ii) Show that...

1 What are the main components of personal financial planning? Solve What is the purpose of a financial plan? Solve Identify some common actions taken to achieve financial goals. How does a job...

Question 1 *Multiplexer CPDs. What is the form of the independence that is implied by the multiplexer CPD and that we used in our derivation of the posterior over the parameters of the simple...

Identify and discuss the benefits of using different types of instructional feedback. Note : You must cite the reference Augmented Feedback How Giving Feedback Influences Learning KEY TERMS absolute...

Two particles execute simple harmonic motion of the same amplitude and frequency along close parallel lines. They pass each other moving in opposite directions each time their displacement is half...

An insurance company has invested in the following fixed-income securities: (a) $10,000,000 of five-year Treasury notes paying 5 percent interest and selling at par value, (b) $5,800,000 of 10-year...

Kou have an irvestment opportunity that promises to pay you $ 2 0 , 0 0 0 at a future date. You can earn 5 % compounded semlannaaly on similar investments. How much would you be willine to invest...

8:37 * N. 80% i ... OBJECTIVES: Create relationships Create a Pivot Table from Related Tables Create a PivotChart Modify the PivotChart The major section in this chapter :ontinuation is: Data...