Question: PLEASE HELP ME IN C++ In this lab, you will implement part of a naive Bayes' spam classi er. To illustrate how this lter works,

PLEASE HELP ME IN C++

In this lab, you will implement part of a naive Bayes' spam classi er. To illustrate how this lter works, consider the following email: Hey! This is the best link I found. I thought you would want to see it! www.somelink.com/example Best, Sus We want to classify this email as either spam or not spam. Typically, the lter will consider the entire email and look for multiple words that are common in spam emails. For our lter, we will consider a single word. For this example, we will classify the email based on the word \best". Assume the probability that any particular email is spam is 0.25, and the probability that any particular email is not spam is 0.75. To classify the mystery email (above), we want to compute the probability that this email is spam given that it contains the word \best". Then we want to compute the probability that this email is NOT spam given that it contains the word \best". We then classify based on which probability is higher. Let's de ne a couple of variables: 1. C: email contains the word \best" 2. C: email does NOT contain the word \best" 3. S: email is spam 4. S: email is NOT spam Hence, we want to compute P(SjC) and P(SjC). Computing the Probability of \best" First, we need to gure out how common \best" is in spam emails and how common \best" is in emails that are not spam. To do this, we have to use sample emails. This is called training data. For this example, we'll use the following emails. These are the sample spam emails: 1. you've been selected as a winner! click now to get the best anti-virus scanner! 2. you're a winner! reply immediately to claim your access to the best weight loss system ever.

These are the sample not spam emails: 1. hi, thank you so much! you're the best! 2. hi, are you going to be here on friday? Since we have two categories, we can only compute conditional probabilities. Those probabilities are P(CjS) and P(CjS), the probability of \best" given a spam message and the probability of \best" given a non-spam email, respectively. There are two spam emails with a total of 30 words. Of the 30 words in the spam messages, 2 of those words are \best". Hence, P(CjS) = 2 30 . Similarly, there are two emails that are not spam with a total of 17 words. The word \best" only appears once in any of the non-spam emails. Hence, P(CjS) = 1 17 Using the Law of Total Probability, P(C) = P(CjS)P(S) + P(CjS)P(S) P(C) = 2 30 0:25 + 1 17 0:75 P(C) = 0:0608 Throwing Bayes' Theorem In To recap, we want P(SjC) and P(SjC). We also have P(S) = 0:25 P(CjS) = 2 30 = 0:0667 P(CjS) = 1 17 = 0:0588 P(C) = 0:0608 Bayes' Theorem tells us that P(SjC) = P(CjS)P(S) P(C) = 0:0667 0:25 0:0608 = 0:2743 P(SjC) = P(CjS)P(S) P(C) = 0:0588 0:75 0:0608 = 0:7253 Thus, we would conclude that this email is not spam since 0.7253 is greater than 0.2743.

Program Speci cations 1. Ask for a word contained in the mystery email. 2. Using the sample emails that are provided, compute and display the following probabilities: the probability that the word occurs given the email is spam (P(CjS)) the probability that the word occurs given the email is NOT spam (P(CjS)) the probability that the mystery email is spam the probability that the mystery email is NOT spam 3. Print out the email's classi cation - Spam or Not Spam

Some Details The sample email les provided are in all lowercase. You don't have to account for di erences in capitalization. Your program should work for any ve les named email1.txt, email2.txt,... and spam1.txt, spam2.txt,... Assume the probability of spam mail is 0.25, and the probability an email is not spam is 0.75. You should remove the following punctuation to accurately count words: periods, commas, exclamation points (!), and question marks (?) You can leave any other punctuation alone. It is possible for a particular word to have a probability of zero. We don't want that. To help alleviate this issue, you should add one to all of the counts. For example, going back to the example above, the probabilities would be 3 31 and 2 18 Don't worry about rounding. (But if it bothers you, round to 4 decimal places at the very end)

Example Output Here is an example for what your program's output could look like. Yours doesn't have to look exactly like this, as long as it prints out the right things. User input is in blue. This example uses the les from smallTestFiles -------------------------------- Title Goes Here -------------------------------- Purpose Goes Here -------------------------------- Which word is contained in the mystery email? winner Probability of a spam email containing the word winner: 0.0967742 Probability of a non-spam email containing the word winner: 0.0555556 Probability the email is spam: 0.367347 Probability the email is not spam: 0.632653 Your message is not spam! This example uses the les from testFiles -------------------------------- Title Goes Here -------------------------------- Purpose Goes Here --------------------------------

Which word is contained in the mystery email? diet Probability of a spam email containing the word diet: 0.0023948 Probability of a non-spam email containing the word diet: 0.00128535 Probability the email is spam: 0.383116 Probability the email is not spam: 0.616884 Your message is not spam!

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

PLEASE CODE IN C++ We want to classify this email as either spam or not spam. Typically, the filter will consider the entire email and look for multiple words that are common in spam emails. For our...

NOTE: THIS IS FROM "DISCRETE MATH" COURSE FOR COMPUTER SCIENCE I RECOMMEND YOU TO DO THIS ASSIGNMENT ON VISUAL STUDIO SINCE I HAVE NEVER TAKING C++, I MAY HAVE SOME DIFFICULTY FOR THIS ASSIGNMENT....

NOTE: I RECOMMEND YOU TO DO THIS ASSIGNMENT ON VISUAL STUDIO SINCE I HAVE NEVER TAKING C++, SO I MAY HAVE SOME DIFFICULTY FOR THIS ASSIGNMENT. THEREFORE, I HOPE YOU CAN DO THIS ON MICROSOFT VISUAL...

Algorithms in Artificial Intelligence (or, the old name: Introduction to Algorithmic Decision Making) Part 1 Based on slides by David Sarne and Lirong Xia Course Tentative Schedule Introduction...

Developments in Technology Light is incident from air on the end face of a multimode optical fibre at angle of incidence as shown below. n n 1 2 The refractive indices of the core and cladding are...

P e e r -R e v ie w e d O p tim iz in g Safety Engineering, Systems, Human Factors: Part 1 By Vladimir Ivensky T safety program is to reduce or eliminate in cidents that result in harm to people or...

Q-1 Read the case properly and elaborate on risk assessment and Link your analysis to Don Rogers feeling about Tetra Techs risk. What has Tetra Tech done to overcome the difficulties associated with...

Toys World started and finished job number A26, a batch of 1,000 cuddly koalas, during March 2020. The job required $4,850 of direct material and 32 hours of direct labour at $20 per hour. The...

Downhill Boards (DB), a producer of snow boards, is evaluating a new process for applying the finish to its snow boards. Durable Finish Company (DFC) has offered to apply the finish for $170,000 in...

If the coverage group term life insurance benefit that Diego receives from his business is increased to $ 2 1 9 , 3 0 4 at an annual cost of $ 9 5 what amount of imputed income will be inctuded on...

Given a truth table design a perceptron and find the weights and threshold of the perceptron

What other transactions or datasets lend themselves to running Benfords law analysis?

Using an alpha of 0.05, what is your decision regarding the fictitious dataset?

Explain how the IMPACT model applies to management accounting problems.