Question: here is the link for the csv file http://www.dewetcomputers.com/CSV/dataset_assignment3.csv The project asks you to develop, evaluate and compare models for the prediction of proteins that

here is the link for the csv file http://www.dewetcomputers.com/CSV/dataset_assignment3.csv The project

asks you to develop, evaluate and compare models for the prediction of

here is the link for the csv file

http://www.dewetcomputers.com/CSV/dataset_assignment3.csv

The project asks you to develop, evaluate and compare models for the prediction of proteins that interact with nucleic acids (either DNA or RNA) using a provided dataset. Dataset The dataset ("dataset assignment3.csv" file) is provided in the text-based, comma-separated format where each protein is represented by 13 numeric features and 1 symbolic outcome. The outcome feature (called "class") annotates each proteins as Y (interacting) vs. N (non- interacting). The dataset includes 8795 proteins, with 936 labeled Y (interacting with nucleic acids) and 7859 labeled N (not interacting with nucleic acids) Development of predictive models You are required to compute models with version 9.0 of the RapidMiner Studio using five different algorithms. Three of these five algorithms must be the Decision Tree, kNN and Nave Bayes. You can choose any of the other predictive algorithms for the remaining two. You shoulod parametrize each of these algorithms (select the best possible combination of values of their parameters), to the best of your ability, in order to maximize predictive performance. Note that you will need to read, make an educated guess, and/or use trial-and-error approach to figure out which parameters make a difference and how to use them. Do not use the "advanced parameters". Do not attempt to sample the dataset, i.e., do not perform feature or sample/object selection. Evaluation and comparison of predictive models You must evaluate the predictive performance using accuracy (1% of correctly classified instances"). For each algorithm you must perform three types of tests on the entire dataset (use training dataset") on 50% of the dataset; you will use the other 50% to compute the model ("percentage split") - using the 10 fold cross-validation The 10 fold cross-validation divides the dataset at random into 10 equal-size subsets, where one subset is used to test the model and the remaining nine to compute the prediction model. This is repeated 10 times, each time using a different subset as the test set. Consequently, this test results in predicting every protein in the dataset. This test type is implemented in the RapidMiner Studio with the "Cross Validation" operator where the number of folds is set to 10 1. List and briefly describe the methods that you used (one sentence per method) and list their ey parameters Using the table shown below, report the accuracies for the five algorithms and the three test types. The accuracy values must be reported with two digits after the decimal point, e.g., 91.05. You must include the accuracies of the models that use default parameters and the best selected parameters. In total, you have 5*3*2 = 30 results to report. List the best selected values of parameters for each model and each test type 2. The project asks you to develop, evaluate and compare models for the prediction of proteins that interact with nucleic acids (either DNA or RNA) using a provided dataset. Dataset The dataset ("dataset assignment3.csv" file) is provided in the text-based, comma-separated format where each protein is represented by 13 numeric features and 1 symbolic outcome. The outcome feature (called "class") annotates each proteins as Y (interacting) vs. N (non- interacting). The dataset includes 8795 proteins, with 936 labeled Y (interacting with nucleic acids) and 7859 labeled N (not interacting with nucleic acids) Development of predictive models You are required to compute models with version 9.0 of the RapidMiner Studio using five different algorithms. Three of these five algorithms must be the Decision Tree, kNN and Nave Bayes. You can choose any of the other predictive algorithms for the remaining two. You shoulod parametrize each of these algorithms (select the best possible combination of values of their parameters), to the best of your ability, in order to maximize predictive performance. Note that you will need to read, make an educated guess, and/or use trial-and-error approach to figure out which parameters make a difference and how to use them. Do not use the "advanced parameters". Do not attempt to sample the dataset, i.e., do not perform feature or sample/object selection. Evaluation and comparison of predictive models You must evaluate the predictive performance using accuracy (1% of correctly classified instances"). For each algorithm you must perform three types of tests on the entire dataset (use training dataset") on 50% of the dataset; you will use the other 50% to compute the model ("percentage split") - using the 10 fold cross-validation The 10 fold cross-validation divides the dataset at random into 10 equal-size subsets, where one subset is used to test the model and the remaining nine to compute the prediction model. This is repeated 10 times, each time using a different subset as the test set. Consequently, this test results in predicting every protein in the dataset. This test type is implemented in the RapidMiner Studio with the "Cross Validation" operator where the number of folds is set to 10 1. List and briefly describe the methods that you used (one sentence per method) and list their ey parameters Using the table shown below, report the accuracies for the five algorithms and the three test types. The accuracy values must be reported with two digits after the decimal point, e.g., 91.05. You must include the accuracies of the models that use default parameters and the best selected parameters. In total, you have 5*3*2 = 30 results to report. List the best selected values of parameters for each model and each test type 2

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

1 PM665 Project Management Capstone Project Name Your Name Date 2 Table of Contents 1.0 Introduction...

4) Imagine you have created a duplicate scenario of the No_Areas scenario. The new scenario is called Q4_No_Areas_Failure. In this new scenario you have simulated a failure of the link connecting...

Comprehensive Accounting Cycle Review 4-1 Mike Greenberg opened Flounder Window Washing Inc. on July 1, 2017. During July, the following transactions were completed. July 1 Issued 10,100 shares of...

**Data Structures in C language: Need help finishing code (in C language) for a program that uses a hash map to implement case-insensitive spell checker. Code with what I already have/header files is...

Data Structures & C Programming: Need help identifying what is wrong with spellchecker program for hashtables. The code compiles and works correctly when a word is input that is spelled correctly....

Use the coding skeleton provided to create a spellchecker. There are a lot of uses for a hash map, and one of them is implementing a spell checker. All you need to get started is a dictionary, which...

Hello, I've completed the hashMap implementation for this assignment but was having some issues with the spellCheck.c portion of the assignment and would appreciate any assistance available. Part 3:...

#ifndef LINKED_LIST_H #define LINKED_LIST_H #ifndef TYPE #define TYPE int #endif #ifndef LT #define LT(A, B) ((A)

This code is incomplete, read the comments bellow. JAVA public class PiMonteCarloSkeleton{ /** * width of the square. */ private static final double WIDTH = 1; /** * Radius of the circle which is...

You are a Medical Secretary who has been working in a gynecology clinic. Your first appointment is at 8:30 AM with a young girl named Linda whos 15 years old. The appointment is limited to 30...

What information is contained in a bond indenture? What purpose does it serve?

Which choice is a virtual hard disk file? A . iso B . vma C . vxd D . vhdx

Which of the following entity types is subject to double taxation? O Partnership O Corporation O Fiduciary O Sole Proprietorship

=+j Should courses for management development be handled differently than training for host-country and third-country employees?

=+j To ensure respect for each host countrys culture, should each subsidiary or

=+joint venture develop its own training? Do they have the capability? Or are there strong reasons to insist on centrally developed training programs?