Question: The spam filter in the lecture had 58 input variables to determine the output variable. Most of these record either the proportion of certain words
The spam filter in the lecture had 58 input variables to determine the output variable. Most of these record either the proportion of certain
words in the email ("money" = 0.01 would mean that "money" made up 1% of the words) or the proportion of certain characters in the email
(such as "!" = 0.01 meaning that exclamation points made up 1% of the characters in the email). The variable TOTCAPS, the total number of capital letters in the email, is quite different. When a PCA (principal components analysis) is done with the variables NOT being normalized, the first principal component is dominated by TOTCAPS and captures 92.7% of the variance. When the variables are normalized, the first component captures only 11.6% of the variation and is not dominated by any variable.Explain why this kind of result is to be expected.
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
