In this problem you will use the data and scenario described in this chapter's example, in which

Question:

In this problem you will use the data and scenario described in this chapter's example, in which the task is to develop a model to classify documents as either auto-related or electronics-related.

a. From the folder autos-electronics, import the files into XLMiner using the Sample menu and the option to read the file contents into individual rows during import. Which row ID's correspond to the autos class? To the electronics class?

b. Proceed in XLMiner to the preprocessing of the text, accepting defaults except as noted. Explain what would be different if you unchecked the "perform stemming" box (but leave it checked).

c. Continue to the Representation options in XLMiner, and change the maximum number of concepts to 20. Explain what is different about the Term Frequency matrix, as opposed to the TF-IDF matrix.

d. Continue to the Output Options section, without changing defaults.

i. Explain very briefly how the different matrix options differ.

ii. Restate the goal of the text mining project, and why we are not using the Concept Extraction options.

e. Going back to your notes about which rows correspond to which class of documents, add class identifications to the concept document matrix. Using this matrix, fit a predictive model (different from the model presented in the chapter illustration) to classify documents (rows) as autos or electronics. Compare its performance to that of the model presented in the chapter illustration.

Fantastic news! We've Found the answer you've been seeking!