Consider the case of a website that caters to the needs of a specific farming community and

Question:

Consider the case of a website that caters to the needs of a specific farming community and carries classified ads intended for that community. Anyone, including robots, can post an ad via a web interface, and the site owners have problems with ads that are fraudulent, spam, or simply not relevant to the community. They have provided a file with 4143 ads, each ad in a row, and each ad labeled as either −1 (not relevant) or 1 (relevant). The goal is to develop a predictive model that can classify ads automatically.

• Open the file farm-ads.csv, and briefly review some of the relevant and non-relevant ads to get a flavor for their contents.

• Following the example in the chapter, preprocess the data in RapidMiner, and create a term–document matrix and a concept matrix. Limit the number of concepts to 20.

a. Examine the term–document matrix.

i. Is it sparse or dense?

ii. Find two non-zero entries and briefly interpret their meaning, in words (you do not need to derive their calculation).

b. Briefly explain the difference between the term–document matrix and the concept–document matrix. Relate the latter to what you learned in the principal components chapter (Chapter 4).

c. Using logistic regression, partition the data (60% training, 40% holdout), and develop a model to classify the documents as “relevant” or “non-relevant.” Comment on its efficacy.

d. Why use the concept–document matrix, and not the term-document matrix, to provide the predictor attributes?

Fantastic news! We've Found the answer you've been seeking!