Question: PLEASE HELP ME IN C++ In this lab, you will implement part of a naive Bayes' spam classi er. To illustrate how this lter works,
PLEASE HELP ME IN C++
In this lab, you will implement part of a naive Bayes' spam classi er. To illustrate how this lter works, consider the following email: Hey! This is the best link I found. I thought you would want to see it! www.somelink.com/example Best, Sus We want to classify this email as either spam or not spam. Typically, the lter will consider the entire email and look for multiple words that are common in spam emails. For our lter, we will consider a single word. For this example, we will classify the email based on the word \best". Assume the probability that any particular email is spam is 0.25, and the probability that any particular email is not spam is 0.75. To classify the mystery email (above), we want to compute the probability that this email is spam given that it contains the word \best". Then we want to compute the probability that this email is NOT spam given that it contains the word \best". We then classify based on which probability is higher. Let's de ne a couple of variables: 1. C: email contains the word \best" 2. C: email does NOT contain the word \best" 3. S: email is spam 4. S: email is NOT spam Hence, we want to compute P(SjC) and P(SjC). Computing the Probability of \best" First, we need to gure out how common \best" is in spam emails and how common \best" is in emails that are not spam. To do this, we have to use sample emails. This is called training data. For this example, we'll use the following emails. These are the sample spam emails: 1. you've been selected as a winner! click now to get the best anti-virus scanner! 2. you're a winner! reply immediately to claim your access to the best weight loss system ever.
These are the sample not spam emails: 1. hi, thank you so much! you're the best! 2. hi, are you going to be here on friday? Since we have two categories, we can only compute conditional probabilities. Those probabilities are P(CjS) and P(CjS), the probability of \best" given a spam message and the probability of \best" given a non-spam email, respectively. There are two spam emails with a total of 30 words. Of the 30 words in the spam messages, 2 of those words are \best". Hence, P(CjS) = 2 30 . Similarly, there are two emails that are not spam with a total of 17 words. The word \best" only appears once in any of the non-spam emails. Hence, P(CjS) = 1 17 Using the Law of Total Probability, P(C) = P(CjS)P(S) + P(CjS)P(S) P(C) = 2 30 0:25 + 1 17 0:75 P(C) = 0:0608 Throwing Bayes' Theorem In To recap, we want P(SjC) and P(SjC). We also have P(S) = 0:25 P(CjS) = 2 30 = 0:0667 P(CjS) = 1 17 = 0:0588 P(C) = 0:0608 Bayes' Theorem tells us that P(SjC) = P(CjS)P(S) P(C) = 0:0667 0:25 0:0608 = 0:2743 P(SjC) = P(CjS)P(S) P(C) = 0:0588 0:75 0:0608 = 0:7253 Thus, we would conclude that this email is not spam since 0.7253 is greater than 0.2743.
Program Speci cations 1. Ask for a word contained in the mystery email. 2. Using the sample emails that are provided, compute and display the following probabilities: the probability that the word occurs given the email is spam (P(CjS)) the probability that the word occurs given the email is NOT spam (P(CjS)) the probability that the mystery email is spam the probability that the mystery email is NOT spam 3. Print out the email's classi cation - Spam or Not Spam
Some Details The sample email les provided are in all lowercase. You don't have to account for di erences in capitalization. Your program should work for any ve les named email1.txt, email2.txt,... and spam1.txt, spam2.txt,... Assume the probability of spam mail is 0.25, and the probability an email is not spam is 0.75. You should remove the following punctuation to accurately count words: periods, commas, exclamation points (!), and question marks (?) You can leave any other punctuation alone. It is possible for a particular word to have a probability of zero. We don't want that. To help alleviate this issue, you should add one to all of the counts. For example, going back to the example above, the probabilities would be 3 31 and 2 18 Don't worry about rounding. (But if it bothers you, round to 4 decimal places at the very end)
Example Output Here is an example for what your program's output could look like. Yours doesn't have to look exactly like this, as long as it prints out the right things. User input is in blue. This example uses the les from smallTestFiles -------------------------------- Title Goes Here -------------------------------- Purpose Goes Here -------------------------------- Which word is contained in the mystery email? winner Probability of a spam email containing the word winner: 0.0967742 Probability of a non-spam email containing the word winner: 0.0555556 Probability the email is spam: 0.367347 Probability the email is not spam: 0.632653 Your message is not spam! This example uses the les from testFiles -------------------------------- Title Goes Here -------------------------------- Purpose Goes Here --------------------------------
Which word is contained in the mystery email? diet Probability of a spam email containing the word diet: 0.0023948 Probability of a non-spam email containing the word diet: 0.00128535 Probability the email is spam: 0.383116 Probability the email is not spam: 0.616884 Your message is not spam!
Step by Step Solution
There are 3 Steps involved in it
Get step-by-step solutions from verified subject matter experts
