Question: A classification is a form of data analysis where a model or classifier is constructed to predict class labels or types. Data classification is a

A classification is a form of data analysis where a model or classifier is constructed to predict class labels or types. Data classification is a two-phase process, 1) a learning phase in which the classifier is built and 2) a classifying phase where the determination of the class type of given data is performed. With this project, students will be able to address problems in the two phases of classification, by implementing a basic algorithm to construct a classifier and by determining the class labels of given data tuples.

Problem

Our textbook describes in Section 8.2.1 a basic algorithm for inducing a decision tree from training tuples. The algorithm follows a top-down approach, similar to how decision trees are constructed by algorithms such as ID3, C4.5, and CART.

Part 1

Write a Java program that implements the basic algorithm for inducing a decision tree.

Note the following requirements:

-You are not required to fully implement the algorithm as given in pages 332 –335, but to implement the partition that occurs at the root node, i.e. the generation of the first level after the root. Use the description in Figure 8.3 for the algorithm steps, page 333, and the illustration in Figure 8.5 as a reference of the expected output.

-Consider the splitting attribute to be discrete-valued and use information gain as the attribute selection measure.

-Your implementation should use the training tuples given in Table 8.1, page 338 of the textbook. When run, the program should output the information given in Figure 8.5 (the output does not need to look exactly in the way shown there, but must convey the same meaning).

Part 2

Test the implementation in Part 2 by asking the user to enter a tuple and letting the model determine the class label for the tuple.

REFERENCES

8.2.1 Decision Tree Induction

During the late 1970s and early 1980s, J. Ross Quinlan, a researcher in machine learning, developed a decision tree algorithm known as ID3 (Iterative Dichotomiser). This work expanded on earlier work on concept learning systems, described by E. B. Hunt, J. Marin, and P. T. Stone. Quinlan later presented C4.5 (a successor of ID3), which became a benchmark to which newer supervised learning algorithms are often compared. In 1984, a group of statisticians (L. Breiman, J. Friedman, R. Olshen, and C. Stone) published the book Classification and Regression Trees (CART), which described the generation of binary decision trees. ID3 and CART were invented independently of one another at around the same time, yet follow a similar approach for learning decision trees from training tuples. These two cornerstone algorithms spawned a flurry of work on decision tree induction. ID3, C4.5, and CART adopt a greedy (i.e., nonbacktracking) approach in which decision trees are constructed in a top-down recursive divide-and-conquer manner. Most algorithms for decision tree induction also follow a top-down approach, which starts with a training set of tuples and their associated class labels. The training set is recursively partitioned into smaller subsets as the tree is being built. A basic decision tree algorithm is summarized in Figure 8.3. At first glance, the algorithm may appear long, but fear not! It is quite straightforward. The strategy is as follows. The algorithm is called with three parameters: D, attribute list and Attribute selection method. We refer to D as a data partition. Initially, it is the complete set of training tuples and their associated class labels. The parameter attribute list is a list of attributes describing the tuples. The attribute selection method specifies a heuristic procedure for selecting the attribute that “best” discriminates the given tuples according to class. This procedure employs an attribute selection measure such as information gain or the Gini index. Whether the tree is strictly binary is generally driven by the attribute selection measure. Some attribute selection measures, such as the Gini index, enforce the resulting tree to be binary. Others, like information gain, do not, therein allowing multiway splits (i.e., two or more branches to be grown from a node). The tree starts as a single node, N, representing the training tuples in D (step 1).3

Table 8.1 Class-Labeled Training Tuples from the AllElectronics Customer Database RID age

credit rating Class: buys_computer 1 fair excellent fair fair fair 2 3

Table 8.1 Class-Labeled Training Tuples from the AllElectronics Customer Database RID age credit rating Class: buys_computer 1 fair excellent fair fair fair 2 3 4 5 6 7 8 9 10 11 12 13. 14 youth youth middle_aged senior senior senior middle youth youth senior youth income student high no high high no medium low low aged low middle-aged middle aged senior yes yes yes medium no low yes medium yes medium yes medium no high yes medium no excellent excellent fair fair fair excellent excellent fair excellent no no yes yes yes no yes no yes yes yes yes yes no Algorithm: Generate_decision_tree. Generate a decision tree from the training tuples of data partition, D. Input: Output: A decision tree. Method: (3) (4) (5) (6) Data partition, D, which is a set of training tuples and their associated class labels; attribute_list, the set of candidate attributes; (1) create a node N; (2) if tuples in D are all of the same class, C, then Attribute_selection method, a procedure to determine the splitting criterion that best partitions the data tuples into individual classes. This criterion consists of a splitting_attribute and, possibly, either a split-point or splitting subset. (11) (12) (7) label node N with splitting_criterion, (8) (13) (14) return N as a leaf node labeled with the class C; if attribute_list is empty then return N as a leaf node labeled with the majority class in D; // majority voting apply Attribute_selection method(D, attribute_list) to find the best splitting_criterion, if splitting attribute is discrete-valued and multiway splits allowed then // not restricted to binary trees (9) attribute_list attribute_list-splitting_attribute; // remove splitting_attribute (10) for each outcome j of splitting_criterion // partition the tuples and grow subtrees for each partition let D; be the set of data tuples in D satisfying outcome j: // a partition if Dj is empty then attach a leaf labeled with the majority class in D to node N; else attach the node returned by Generate_decision_tree(Dj, attribute_list) to node N; endfor (15) return N; Figure 8.3 Basic algorithm for inducing a decision tree from training tuples. student income high high no no medium no low medium yes yes credit_rating fair excellent fair fair excellent income high low youth medium high class no no no yes yes student no yes no yes age? middle aged income student credit_rating medium yes yes low low senior medium yes medium no credit_rating fair excellent excellent fair class yes yes yes yes fair fair excellent fair excellent class yes yes no yes no. Figure 8.5 The attribute age has the highest information gain and therefore becomes the splitting attribute at the root node of the decision tree. Branches are grown for each outcome of age. The tuples are shown partitioned accordingly.

Step by Step Solution

★★★★★

3.39 Rating (171 Votes )

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock

import javaio import javautil public class Tree public Node buildTreeArrayList records Node root Lea... View full answer

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Accounting Questions!

Planning is one of the most important management functions in any business. A front office managers first step in planning should involve determine the departments goals. Planning also includes...

Case Study: Quick Fix Dental Practice Technology requirements Application must be built using Visual Studio 2019 or Visual Studio 2017, professional or enterprise. The community edition is not...

List three specific parts of the Case Guide, Objectives and Strategy Section (See below) that you had the most difficulty understanding. Describe your current understanding of these parts. Provide...

A review of the ledger of Felipe Company at December 31, 2012, produces the following data pertaining to the preparation of annual adjusting entries.1. Salaries and Wages Payable $0: There are eight...

I-Butanol and chlorobenzene form a minimum-boiling azeotropic system. The mole fraction of I-butanol in the liquid (x) and vapour (y) phases at 1.000 atm is given below for a variety of boiling...

In the SATPLAN algorithm in Figure, each call to the satisfiability algorithm asserts a goal g T where T ranges from 0 to T max . Suppose instead that the satisfiability algorithm is called only...

An analog signal x(t) consists of the sum of two components x 1 (t) and x 2 (t). The spectral characteristics of x(t) are shown in figure. The signal x(t) is bandlimited to 40 kHz and it is sampled...

Every other year, the magazine Hydrocarbon Processing publishes a petroleum-refining handbook, which gives process flow diagrams and data for more than 75 commercial processes. For each of the...

Sorrentino Company, which has been in business for one year, manufactures specialty Italian pastas. The pasta products start in the mixing department, where durum flour, eggs, and water are mixed to...

MacIntyre Fabrications, Ltd., of Aberdeen, Scotland, has recently begun a continuous improvement campaign in conjunction with a move toward Lean Production. Management has developed new performance...

H0: p = 0.6 versus H1: p

Both you and Justin Trudeau are interested in purchasing the same $10 000 000 property scenic Nottawa, Ontario. You and Justin both apply to the Nearly National Trust of Nottawa for a mortgage, which...

Question 1 3 3 pts The group has just outlined their approach to the problem. Part of their plan is to have you ( the manager ) closely monitor the daily tasks. Discourage this approach. You don't...

Find Io in the circuit shown using source transformation 12V 6KO 12

Find Io in the circuit shown using superposition Io 4 12V 4

Show that PSPACE is closed under the operations union, complementation, and star.

A CAS can usually graph solutions, even if they are integrals that cannot be evaluated by the usual analytical methods of calculus. (a) Show this for the five initial value problems y' = e -x 2 ,...

Distinguish between an investors return and historical returns.

1(Multiple Choice Worth 1 points) (03.03 LC) Triangle MNO is dilated to form new triangle PQR. If angle O is congruent to angle R, what other information will prove that the two triangles are similar...

Please Help with this Hypothetical 1065 Tax Return.. Form 1065 Due Thursday, November 12 by 5:00 PM CST US Check Figures - Form 1065 Page 1, Line 22 (Ordinary Business Income) - $156,275 Page 5, Line...

question: with the below example extract the Mining classification rules through the given DMQL . Write a math solutions and anote ( not code ) in pen and paper ( 2 - 3 pages ) . As a marketing...

Developing a Security Education and Training Program GS104.1 You are checking your records and see that Ricardo has not received any formal briefings in the year since he was first granted access to...

Clipboard Font Alignment Number BE Sty D11 X + fx 27 E . B D 1 Payer Full Rate Contracted Rate Contractual Allowance 2 FHP $72.00 $35.70 $36.30 3 HPHP $72.00 $58.85 $13.15 4 MC $72.00 $54.90 $17.10 5...

Need help with this question please thank you. 025. When used, the DoD Contract Security Classification Specification, DD Form 254, provides all of the following functions except: O Separating...

Classification is a form of data analysis where a model or classifier is constructed to predict class labels or types. Data classification is a two-phase process, 1) a learning phase in which the...