Question: Your task for this assignment is to implement the Apriori-based algorithm for frequent itemsets mining using input transaction data that requires cleaning. 1. Implement the

Your task for this assignment is to implement the Apriori-based algorithm for frequent itemsets mining using input transaction data that requires cleaning.

1. Implement the Apriori algorithm that was originally proposed by Agrawal et al. [1] for frequent itemsets mining. You are encouraged to optimize your code for the Apriori algorithm. If you optimize the algorithm, explain the technique applied . You can use any programming language that you are familiar with. The program should be executable with 4 input parameters if the input data file is known to be unclean: the name of the input transaction data file, the name of a product code mapping file used to help clean the data, the threshold of minimum support count and the name of the output report file. The program should also be executable with 3 input parameters if the input data file is known to be clean: the name of the input transaction data file, the threshold of minimum support count, and the name of the output report file. The input threshold should be an integer. An itemset is frequent if its support count is larger or equal to the input threshold. The program should produce a text output report file that contains all the frequent itemsets together with their support. The results are to be written to a text file named cs634_yourname_apriori_n.txt (i.e., where n represents the minimum support count) with the following information on separate lines at the top of the output report file:

a. the ID, section and name of the course

b. your name

c. this file name

d. the program assignment due date e. the program purpose The frequent itemsets results in the output report file should have the following format: each line contains a single frequent itemset as a list of items separated by one space. At the end of the line, the support is printed between a pair of parenthesis. For example: P01 P02 P03 (5) represents a frequent itemset containing items P01, P02 and P03 with a support count of 5.

2. Run your program with 4 input parameters, using the unclean data file named trans_unclean.txt, using minimum support values of 2, 5 and 8, and producing a separate output report file for each different minimum support value. If your program finds no frequent itemsets, an appropriate message should be printed. The trans_unclean.txt data file contains examples of a real-world errors and data integration issues. For this assignment, a transaction may contain invalid items, items represented by product names instead of product codes, and delimiters of semicolons instead of spaces. The trans_unclean.txt data file contains 60 transactions, one transaction per line, from a set of 26 distinct items. The data file contains 15 clean transactions and 45 unclean transactions. A properly formed transaction contains one or more item codes in the form Pnn (i.e., where nn is a number from 01 through 26) separated by one space. For this assignment, transaction cleaning requires: a. examining each input transaction; b. replacing any ; delimiter with a delimiter; c. translating a product name to a product code using the mapping in data file codeprodmap.txt as a reference; d. discarding the transaction if there is an item code on the transaction that is not one of the 26 acceptable item codes and cannot be converted to an acceptable item code; e. saving the transaction in a temporary clean transaction output data file. After creating a temporary clean transaction data file, evaluate this file for frequent itemsets. In the trans_unclean.txt data file, there are 15 clean transactions, 15 unclean transactions that can be acceptably cleaned and 30 unclean transactions that cannot be acceptably cleaned. Your program should be able to run with 3 or 4 input parameters. When running with 3 input parameters, use the clean data file named trans_clean.txt, use minimum support values of 2, 5 and 8, and produce a separate output text file for each different minimum support value. If your program finds no frequent itemsets, an appropriate message should be printed. The trans_clean.txt data file contains 30 clean transactions. If you run your program correctly with 4 input parameters and produces an output report file for a given threshold, there is no need to run the program again with 3 input parameters for the same given threshold. 3. For practice (i.e., not to be submitted for credit), run your program with 3 input parameters using the clean transaction data file named trans_practice.txt. The data file contains 100,000 transactions with an average size of 10 items from a set of 1000 distinct items. A detailed description of the data file can be found in [2]. 4. Create a file named README_cs634_yourname_apriori.txt with instructions for running your program. The first lines of this file are to contain documentation as described in 1a through 1e above. Your program will be run by CS department graders using a variety of input parameters and data files. Your grade will be based on these results as well as your submitted results. 5. Your source program file should be named cs634_yourname_apriori.xxx (i.e., where xxx designates the programming language used). Your program should contain comments starting on line 1 containing information similar to 1a through 1e above. You are encouraged to add additional comments throughout the program that your feel might be helpful to the reader of your source code.

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer
Step: 1 Unlock blur-text-image
Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock
Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!