Question: Problem 2: Decision Trees for Spam Classification (30 points) We'll use the same data as in our earlier homework: In order to reduce my email

Problem 2: Decision Trees for Spam Classification (30 points) We'll use

Problem 2: Decision Trees for Spam Classification (30 points) We'll use the same data as in our earlier homework: In order to reduce my email load, I decide to implement a machine learning algorithm to decide whether or not I should read an email, or simply file it away instead. To train my model, I obtain the following data set of binary-valued features about each email, including whether I know the author or not, whether the email is long or short, and whether it has any of several key words, along with my final decision about whether to read it(y = +1 for "read", y =-1 for "discard") know author? is long? has 'research, has 'grade, has lottery' read? 0 0 0 0 0 0 In the case of any ties where both classes have equal probability, we will prefer to predict class +1 1. Calculate the entropy H(y) of the binary class variable y. Hint: Your answer should be a number between 0 2. Calculate the information gain for each feature x. Which feature should I split on for the root node of the 3. Determine the complete decision tree that will be learned from these data. (The tree should perfectly classify and 1. (5 points) decision tree? (10 points) all training data.) Specify the tree by drawing it, or with a set of nested if-then-else statements. (15 points) Problem 2: Decision Trees for Spam Classification (30 points) We'll use the same data as in our earlier homework: In order to reduce my email load, I decide to implement a machine learning algorithm to decide whether or not I should read an email, or simply file it away instead. To train my model, I obtain the following data set of binary-valued features about each email, including whether I know the author or not, whether the email is long or short, and whether it has any of several key words, along with my final decision about whether to read it(y = +1 for "read", y =-1 for "discard") know author? is long? has 'research, has 'grade, has lottery' read? 0 0 0 0 0 0 In the case of any ties where both classes have equal probability, we will prefer to predict class +1 1. Calculate the entropy H(y) of the binary class variable y. Hint: Your answer should be a number between 0 2. Calculate the information gain for each feature x. Which feature should I split on for the root node of the 3. Determine the complete decision tree that will be learned from these data. (The tree should perfectly classify and 1. (5 points) decision tree? (10 points) all training data.) Specify the tree by drawing it, or with a set of nested if-then-else statements. (15 points)

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

Problem 2 : Decision Tree, post - pruning and cost complexity parameter using sklearn 0 . 2 2 We will use a pre - processed natural language dataset in the CSV file "spamdata.csv " to classify emails...

Problem 2: Decision Tree, post-pruning and cost complexity parameter using sklearn 0.22 We will use a pre-processed natural language dataset in the CSV file "spamdata.csv" to classify emails as spam...

Supply Chain Management Introduction Outline What is supply chain management? Significance of supply chain management. Push vs. Pull processes utdallas.edu/~metin 1 A Generic Supply Chain Sources:...

Planning is one of the most important management functions in any business. A front office managers first step in planning should involve determine the departments goals. Planning also includes...

Abstract This article describes CRISP-DM (Cross-Industry Sandand Process for Data Mining), a non-proprietary, documented, and freely available data mining model. Dezeloped by indias- try leaders...

# ( Health Care Information Systems: A Practical Approach for Health Care Management, 3rd Edition PREV NEXT Chapter 17: Asses... ' " Appendixes CHAPTER 18 Health IT Leadership A Compendium of Case...

A creative engineer suggests structuring the TLB so that not all the bits of the presented address need match to result in a hit. Suggest how this might be achieved, and what might be the costs and...

Jupyter NoteBook Once we decide to measure more than three features per input vector, it can become challenging to understand how a network is learning to solve such a problem since we can no longer...

Follow the steps given in Machine Learning With R , Chapter 5, section "Example Identifying Risky Bank Loans Using C5.0 Decision Trees." download the credit. csv file from Packt Publishing's website...

Classic Corporation began operations in 2014. After issuing its common shares to the public, Classic Corporation completed the following treasury shares transaction: a. Purchased 2,000 shares of the...

Discuss the criteria and the measure of performance used when items are sequenced on a production facility?

Consider a bond with maturity 2 year, 1 0 0 face value, coupon 6 . 5 5 % , and yield 5 . 3 0 % . Compute a dollar duration numerically using a dy = 0 . 0 0 1 % . Report you ressit with two digits...

SIMAD UNIVERSITY Class: BACC25 Subject: Islamic Accounting Instructions: a) Follow The Instructions. Midterm Exam Instructor: All Ibrahim Date: 6-4-2022 b) You Have 1.5 Hrs. To Complete This Test. c)...

How can enumerating the questions in a request help both the person who writes the message and the person who responds to it? (Objective 4)

Claims and condolences concern negative or unpleasant circumstances. Why are they written using the direct plan? (Objectives 4 and 5)

If a writer presents the best news (main idea) in the opening paragraph of a message, why should the receiver read the rest? (Objective 1)