Question: Part 4: Learning Curve with Bootstrapping (8 Points) In this HW we are trying to find the best linear model to predict if a record

Part 4: Learning Curve with Bootstrapping (8 Points) In this HW we are trying to find the best linear model to predict if

Part 4: Learning Curve with Bootstrapping (8 Points) In this HW we are trying to find the best linear model to predict if a record represents the Higgs Boson. One of the drivers of the performance of a model is the sample size of the training set. As a data scientist, sometimes you have to decide if you have enough data or if you should invest in more. We can use learning curve analysis to determine if we have reached a performance plateau. This will inform us on whether or not we should invest in more data (in this case it would be by running more experiments) Given a training set of size N, we test the performance of a model trained on a subsample of size N, where NN. We can plot how performance grows as we move N; from 0 to N Because of the inherent randomness of subsamples of size Ni, we should expect that any single sample of size Ni might not be representative of an algorithm's performance at a given training set size. To quantify this variance and get a better generalization, we will also use bootstrap analysis. In bootstrap analysis, we pull multiple samples of size N, build a model, evaluate on a test set, and then take an average and standard error of the results. An example of using bootstrapping to build a learning curve can be found here https://g Rynb 1. Create a bootstrap function that can do the following def modBootstrapper(train, test, nruns, sampsize, Ir, c) Takes as input: . A master training file (train) A master testing file (test) Number of bootstrap iterations (nruns) . Size of a bootstrap sample (sampsize) An indicator variable to specific LR or SVM (r-1) A c option (only applicable to SVM) .Runs a loop with (nruns) iterations, and within each loop . Sample (sampsize) instances from train, with replacement Fit either an SVM or LR (depending on options specified). Computes AUC on test data using predictions from model in above step Stores the AUC in a list Returns the mean(AUC) and Standard Deviation(AUC) across all bootstrap samples. Note: the standard error of the mean AUC is really the standard deviation of the bootstrapped distribution, so just use np.sqrt(np.var(...) Part 4: Learning Curve with Bootstrapping (8 Points) In this HW we are trying to find the best linear model to predict if a record represents the Higgs Boson. One of the drivers of the performance of a model is the sample size of the training set. As a data scientist, sometimes you have to decide if you have enough data or if you should invest in more. We can use learning curve analysis to determine if we have reached a performance plateau. This will inform us on whether or not we should invest in more data (in this case it would be by running more experiments) Given a training set of size N, we test the performance of a model trained on a subsample of size N, where NN. We can plot how performance grows as we move N; from 0 to N Because of the inherent randomness of subsamples of size Ni, we should expect that any single sample of size Ni might not be representative of an algorithm's performance at a given training set size. To quantify this variance and get a better generalization, we will also use bootstrap analysis. In bootstrap analysis, we pull multiple samples of size N, build a model, evaluate on a test set, and then take an average and standard error of the results. An example of using bootstrapping to build a learning curve can be found here https://g Rynb 1. Create a bootstrap function that can do the following def modBootstrapper(train, test, nruns, sampsize, Ir, c) Takes as input: . A master training file (train) A master testing file (test) Number of bootstrap iterations (nruns) . Size of a bootstrap sample (sampsize) An indicator variable to specific LR or SVM (r-1) A c option (only applicable to SVM) .Runs a loop with (nruns) iterations, and within each loop . Sample (sampsize) instances from train, with replacement Fit either an SVM or LR (depending on options specified). Computes AUC on test data using predictions from model in above step Stores the AUC in a list Returns the mean(AUC) and Standard Deviation(AUC) across all bootstrap samples. Note: the standard error of the mean AUC is really the standard deviation of the bootstrapped distribution, so just use np.sqrt(np.var(...)

Step by Step Solution

There are 3 Steps involved in it

1 Expert Approved Answer

Step: 1 Unlock blur-text-image

Question Has Been Solved by an Expert!

Get step-by-step solutions from verified subject matter experts

Step: 2 Unlock

Step: 3 Unlock

Students Have Also Explored These Related Databases Questions!

CSCI 5525 MACHINE LEARNING, Fall 2017, Prof Schrater Homework 1 September 27, 2017 1. For data (x, y) with a joint distribution p(x, y) = p(y|x)p(x), the expected loss of a function f (x) to model y...

3. For the accompanying data set, (a) draw a scatter diagram of the data, (b) compute the correlation coefficient, and (c) determine whether there is a linear relation between x and y. 1 Click the...

When I was a kid, my family bought a 30 megabyte hard-drive. It died the first month, and we decided to replace it with a more "dependable" 10-megabyte hard-drive. We figured we'd never fill up that...

Instuctor's Annotated Edition TENTH EDITION Understandable Statistics Concepts and Methods Charles Henry Brase Regis University Corrinne Pellillo Brase Arapahoe Community College Australia Brazil...

Set Student Name: 1. Describe the relationship between two variables that have a correlation coefficient value: a. Near -1 b. Near 0 c. Near 1 2. Data was collected where a weightlifter was asked to...

3 COLLEGE ALGEBRA - TRIGONOMETRY Business and Finance (MAT115) This course will start with a review of basic algebra (factoring, solving linear equations, and equalities, etc.) and proceed to a study...

deacribe the main findings of this paper 400 words (be clear on what the analysis involved says, interpret results, or if you could redo the study, what other varaibles or questions would you...

One important measure of workers' performance is their attendance record. A personnel manager at a large company with a serious absenteeism problem undertook a two-year study. She took a random...

Math\t107-6381\t-\tQuiz\t#4\t-\tSchultz\t-\tDue\tFebruary\t21,\t2016\t-\tpage\t1\tof 3 Follow\tthese\tdirections\tcarefully. This\tquiz\tis\tdue\tby\t11:59\tEastern\ttime\ton\tFebruary\t21,\t2016. o...

Table of Common Transformations There are many types of associations that you may encounter. This table lists the most common, and summarizes the way each is transformed. (Warning: This is by no...

Natural Beauty Corp. uses a joint process to make two main products: Forever perfume and Fantasy lotion. Production is organized in two sequential departments: Combining and Heating. The products do...

Determine whether refrigerant R-22 in each of the following states is a compressed liquid, a superheated vapor, or a mixture of saturated liquid and vapor.

| Linkedln Lear | Next steps for learning | Linkedln leadership - professional - certificate / summative - exams / urn:Ii:la _ assessmentV 2 : 6 1 8 4 7 7 9 8 ? u = 2 7 1 5 6 8 0 6 6 8 lipi = urn % 3...

Seved Help 14 Wisconsin Snowmobile Corp. is considering a switch to level production Cost efficiencies would occur under level production, and aftertax costs would decline by $31,500, but inventory...

3. Research the various ADR options to determine which ones are the best fit for the organization. For example, peer review is most successful when there is a high level of trust within the workforce.

7. Immediately notify employees of changes. Be sure to develop a process for maintaining the handbook and letting employees know when things change. While an online handbook can be easily updated, it...

8. Keep a few printed copies available. Some employees are not comfortable with technology and may prefer a hard copy of the handbook. Print a few and let employees know that they are available on an...