Hello again. You know, the trouble with life is that sometimes

everything just comes down to money. In this lesson and the next we’re going to

look at counting the cost in data mining applications. What is success? Well, that’s a pretty good

question, I suppose. In data mining terms, we’ve looked at the

classification rate, measured on a test set or holdout or cross-validation. But, essentially, we’re trying to minimize

the number of errors or maximize the classification rate. In real life, different kinds of errors might

have different costs, and minimizing the total errors might be inappropriate. Now, we looked at the ROC curve in Class 2,

and that shows you the different tradeoffs between the different error costs. But it’s not really appropriate if you actually

know the error costs. Then we want to pick a particular point on

this ROC curve. We’re going to look at the credit rating dataset,

credit-g.arff. It’s worse to class a customer as “good” when

they’re “bad” then it is to class a customer as “bad” when they’re “good”. In this dataset, the class value is “good”

or “bad”. The idea is that if you class someone as “good”

when they’re “bad” and you give them a loan, then he’s going to run away with all your

money, whereas if you make an error the other way round then you might have an opportunity

to rectify it later on. To tell you the truth, I know nothing about

the credit rating industry, but let’s just suppose that’s the case. Furthermore, let’s suppose that the cost ratio

is 5 to 1. I’ve got the credit dataset open here, and

I’m going to run J48. What I get is an error rate of 29.5%, a success

rate of 70-71%. Down here is the confusion matrix. I’ve copied those over here on to this slide. You can see that the cost here, the number

of errors, is effectively the 183 plus 112, those off-diagonal elements of the confusion

matrix. If errors cost the same amount, that’s a fair

reflection of the cost of this confusion matrix. However, if the cost matrix is different,

then we need to do a different kind of evaluation. On the Classify panel, we can do a cost-sensitive

evaluation. Let me go and do that for you. In the More options menu, we’re going to do

a cost-sensitive evaluation. I need to set a cost matrix. This interface is a little weird. I need a 2 by 2 matrix; I’m going to resize

this. Here we’re got a cost of 1 for both kinds

of error, but I want a cost of 5 for this kind of error. Just close that and then run this again. Now I’ve got the same result, the same confusion

matrix, but I’ve got some more figures here. I’ve got a total cost of 1027 and an average

cost of 1.027. (There are 1000 instances in this dataset.) Coming back to the slide, the cost here is

computed by taking the 183 in the lower left and multiplying it by 5–because that’s the

cost of errors down there–and the 112 times 1, adding those up, and I get 1027. If I take the baseline, let’s go and have

a look at ZeroR. I’m going to run ZeroR on this. Here it is. Here I get a cost of 1500. I get this confusion matrix. Over here on the slide, there’s the confusion

matrix. And although I’ve only got 300 errors here,

they’re expensive errors, they each cost $5, so I’ve got a cost of 1500. This is classifying everything as “good”,

because there are more “good” instances than “bad” in this dataset. If I were to classify everything as “bad”

the total cost would only be 700. That’s actually better than either J48 or

ZeroR. Obviously we ought to be taking the cost matrix

into account when we’re doing the classification, and that’s exactly what the CostSensitiveClassifier

does. We’re going to take the CostSensitiveClassifier,

select J48, define a cost matrix, and see what happens. It’s in the meta>CostSensitiveClassifier,

which is here. I can define a classifier. I’m going to choose J48, which is here. I need to specify my cost matrix. I want it 2 by 2. I’ll need to resize that. I need to put a 5 down here. Cool. I’m just going to run it. Now I get a worse classification error. We’ve only got 60-61% accuracy, but we’ve

got a smaller cost, 658. And we’ve got a different confusion matrix. Back here on the slide you can see that. The old confusion matrix looked like this,

and the new confusion matrix is the one on the right. You can see that the number 183 of expensive

errors has been reduced to 66. That brings the cost down, the average cost,

to 0.66 per instance instead of 1.027, despite the fact that we now have a worse classification

rate. Let’s look at what ZeroR does with the CostSensitiveClassifier. It’s kind of interesting because we’re going

to get a different rule. Instead of classifying everything as “good”,

we’re going to classify everything as “bad”. We’re going to make 700 mistakes, but they’re

cheap mistakes. It’s only going to cost us $700. That’s what we’ve learned today. Is classification accuracy the best measure?

Very likely it isn’t. In real life, different kinds of errors usually

do have different costs. If you don’t know the costs, you just might

want to look at the tradeoff between the error costs, different parts of the space; and the

ROC curve is appropriate for that. But if you do know the costs–the cost matrix–then

you can do cost-sensitive evaluation to find the total cost on the test set of a particular

learned model, or you can do cost-sensitive classification, that is, take the costs into

account when producing the classifier. The CostSensitiveClassifier does this: it

makes any classifier cost-sensitive. How does it do this? Very good question. We’re going to find out in the next lesson. Off you go now and do the activity, and we’ll

see you soon. Bye for now!

## 2 Comments

Added! Nice to meet you sir

So given that there are many instances where we may not know the cost of a specific error type. Is it possible to create a tensor so show ourselves the varying output with different costs?