Question: provide the details Conclusions 4. MODELING AND EVALUATION Operator Process Document (Figure 2) from data was used during the process of modeling with the aim

provide the details Conclusions
4. MODELING AND EVALUATION Operator Process Document (Figure 2) from data was used during the process of modeling with the aim to execute tokenization. Suboperators used in process are Tokenize, Filter Tokens by Length, Stem and Filter Stopword (English) (Figure 3). Operator Tokenize recognize everything that is not a single letter as a special word (Token) meaning that option "non letter" is on. For filter Tokens operator minimal word length is setup to be 3 characters (because of words like "bad" which can carry strong emotion) and for maximum 25 characters. For word rooting Porters algorithm for "stemming" is used. At the end an operator for removing English stopwords is applied. Process Parameters Context XML M Process Process Documents from Data Marc create word vector vector creation TF OF Madd mata information op tot lect ATE prune method none * - datamant double sparse_art velocibutes and weights Figure 2: Process Documents from Data 39 Process Process Documents from Data Parameters 0 Vector Creation Tokenize Tokenize mode doc non letters Q Filter Tokens Filter Tokens (by Length) min chars 3 Stem (Porter) dae dos max chars 25 O Filter Stopwor. Stem (Porter dac Filter Stopwords (English) Figure 3: Process text suboperators Several models for classification were used: k-NN, W-B1 (Weka classifier) and Naive Bayes (Table 1). It turned out that using operator Naive Bayes takes shorter time for execution, but accuracy gained is lower comparing with two other operators (operator W-IB1 and operator Naive Bayes). Operator W-1B1 achieves precision as K-NN, but W-IB1 operator needs lot more time. So, k-NN operator, based on principle of neuron network, is used in this case. For assessment of optimal number "k" the Optimize Parameters node was used while for measure type "Cosine Similarity" has been used. For model evaluation, operator Validation(X-Validation) has been used. Number of validation is setup to 5 and the sample was stratified. Criteria of goodness was "accuracy" (according to contest propositions). The best result for predicting accuracy using operator K-NN was 0.53 on 6% of sample set (Figure 4). For operator Naive Bayes best accuracy achieved (in best case) is 0.22. Table 1: Comparing result of 3 operators MODEL Time 1:59 K-NN K = 1 K = 2 K = 3 K = 4 Accuracy 49.39% +/- 1.29% 49.46% +/- 1.29% 51.06% +/- 0.57% 50.38% +/- 0.68% accuracy: 49.20% +/- 1.09% 20.37% +/- 0.93% 20.37% +/-0.93% 2:00 1:52 1:37 16:43 0:31 0:32 W-IB1 Naive Bayes Laplace correctionStep by Step Solution
There are 3 Steps involved in it
1 Expert Approved Answer
Step: 1 Unlock
Question Has Been Solved by an Expert!
Get step-by-step solutions from verified subject matter experts
Step: 2 Unlock
Step: 3 Unlock
