data mining case, spreadsheet modeling analysis

Project Description:

in the wake of the enron scandal in 2002, two public accounting firms, oscar anderson (oa) and trice-milkhouse-loopers (tml) merged (forming oatml) and are reviewing their methods for detecting management fraud during audits. the two firms had each developed their own set o questions that auditors could use in assessing management fraud.
to avoid a repeat of the problems faced by enron’s auditors, oatml wants to develop an automated decision tool to assist auditors in predicting whether or not their clients are engaged in fraudulent management practices. this tool would basically ask an auditor all the oa or tml fraud detection questions and then automatically render a decision about whether or not the client company is engaging in fraudulent activities. the decision problem oatml faces is really two-fold: 1) which of the two sets of fraud detection questions are best at detecting fraud? and, 2) what’s the best way to translate the answers to these questions into a prediction or classification about management fraud?
to assist in answering these questions, the company has compiled an excel spreadsheet that contains both the oa and tml fraud detection questions and answers to both sets o questions based on 382 audits previously conducted the two companies (see sheets oa and tml, respectively). (note: for all data 1 = yes, 0 = no). for each audit, the last variable in the spreadsheet indicates whether or not the respective companies were engaged in fraudulent activities (i.e., 77 audits uncovered fraudulent activities, 305 did not). you have been asked to perform the following analysis and provide a recommendation as to what combination of fraud questions oatml should adopt.


for the oa fraud questions, create a correlation matrix for all the variables. do any of the correlations pose a concern?

2. using the 8 questions that correlate most strongly with the dependent fraud variable, partition the oa data with oversampling to create a training and validation data sets with a 50% success rate in the training data.

3. use each of xlminer’s classification techniques to create classifiers for the partitioned oa dataset. summarize the classification accuracy of each technique on the training and validation sets. interpret these results and indicate which technique you would recommend oatml use.

4. for the tml fraud questions, create a correlation matrix or all the variables. do any of the correlations pose a concern?

5. using the 8 questions that correlate most strongly with the dependent fraud variable, partition the tml data with oversampling to create a training and validation data sets with a 50% success rate in the training data.

6. use each of xlminer’s classifcation techniques to create classifers for the partitioned tml dataset. summarize the classication accuracy of each technique on the training and validation sets. interpret these results and indicate which technique you would recommend oatml use.

7. suppose oatml wants to use both raud detection instruments and combine their individual results to create a composite prediction. let lr1 represent the logistic regression probability estimate or a given company using the oa fraud detection instrument and lr2 represent the same company’s logistic regression probability estimate using the tml instrument. the composite score for the company might then be defined as c = w1lr1 + (1 -w1)lr2 where 0<or= w1 <or= 1.
Skills Required:
Project Stats:

Price Type: Fixed

Project Budget: $10 to $20
Refunded
Total Proposals: 1
1 Current viewersl
50 Total views
Project posted by:

Proposals

Proposals Reputation Price offered
  • 4.6
    103 Jobs 52 Reviews
    $50 in 48 Hours