Question: allocate space for the data array, making sure that everything is properly initialized. The data will be pulled from a file specified as the input
allocate space for the data array, making sure that everything is properly initialized. The data will be pulled from a file specified as the input to the constructor. The first line of this file will contain a header with numbers: the first specifies the number of datapoints in the file, and the second specifies the dimension of each datapoint. The file will thus have a number of rows specified by the first number in the header, and a number of columns specified by the second. Once the data array has been initialized consistent with the header information in the file, the KMC constructor can pull all the data from the file into this array, and store the number and dimension in the numdata and dim fields.
The train function will receive the three inputs discussed above, specifying how the training is to be accomplished. The first input is the number of training cycles to use, and the third is the number of centers to use in the clusteringclassification analysis note that train will have to allocate and properly initialize the c and nc fields in the KMC class consistent with the specified number of centers and the dimension of the datapoints. For the starting values of the centers, use the first nc points in the data array. Finally, the second input to train is a float specifying the fraction of the data in data that should be used in the training the rest will be reserved to test the accuracy of the learning when it has been completed If we call this second input frac, then only the first fracnumdata points in data should be used for the training.
The classify function identifies which center a specified datapoint is closest to using the vector norm measurement. The output should be the index in the c array of the closest center to the input vector. The centers function is just a simple accessor that returns the c array as output.
When all this code is written properly, with the irisdat file providing the data, the
following output should appear in the file results.dat.
c:
c:
c:
Finally now to consider the scoring of how well the trained model performs at classifying data. This will be accomplished by the final member function in KMC called score, which outputs its results in a learningScore structure. The details of how this scoring is accomplished are provided in the following section.
Note: The provided kmccpp file already contains all the code needed for the Vector class and its overloaded operator. All you have to do is write the code for the new KMC class to implement the training process described using the vector arithmetic enabled by the Vector class. Your solution is expected to use these features of the Vector class to accomplish its calculations.
Note: Your submission will be graded first using the provided irisdat file, then with a different data file. The number of data points, the dimension of each data point, and the number of clusterstypes will be different in the new dataset.
Scoring
There are several metrics used to evaluate the quality of the model developed by a machine learning algorithm, including a precision score
a recall score
and a composite metric called
which is defined in terms of
as
The
and
metrics in turn are calculated by comparing the predictions the trained model makes with the assumed known type that each data point actually corresponds to In particular we need to count the number of true positivetp results, where the model correctly predicts the type represented by a data point, the number of false negativefn results, where the model incorrectly predicts a different type from that represented by the data, and the number of false positivefp results which counts the actual type predicted by the model when its prediction is incorrect. In terms of tp fn fp the
and
scores for each type are computed as
There is a separate
score for each type in the dataset, and it is common to average the scores for the different types to produce a composite, overall score.
The score function for the KMC class must compute these metrics and store them in
a learningScore structure which is the output from the function. The identification of
the tpfnfp score for each type can be accomplished as follows:
For each point in the dataset, determine how the model classifies it using the classify function Suppose that the model classifies the point as type j while the actual classification of the point is type k
If j and k match, increase the tp score for type k by one. Otherwise, increase the fn score for type k by one, and increase the fp score for type j by one.
Once every data point has been processed according to above, you will have the fpfnfp scores for each type, and can compute the corresponding
Step by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
