# More Data Mining with Weka (4.5: Counting the cost) Hello again. You know, the trouble with life is that sometimes
everything just comes down to money. In this lesson and the next we’re going to
look at counting the cost in data mining applications. What is success? Well, that’s a pretty good
question, I suppose. In data mining terms, we’ve looked at the
classification rate, measured on a test set or holdout or cross-validation. But, essentially, we’re trying to minimize
the number of errors or maximize the classification rate. In real life, different kinds of errors might
have different costs, and minimizing the total errors might be inappropriate. Now, we looked at the ROC curve in Class 2,
and that shows you the different tradeoffs between the different error costs. But it’s not really appropriate if you actually
know the error costs. Then we want to pick a particular point on
this ROC curve. We’re going to look at the credit rating dataset,
credit-g.arff. It’s worse to class a customer as “good” when
they’re “bad” then it is to class a customer as “bad” when they’re “good”. In this dataset, the class value is “good”
or “bad”. The idea is that if you class someone as “good”
when they’re “bad” and you give them a loan, then he’s going to run away with all your
money, whereas if you make an error the other way round then you might have an opportunity
to rectify it later on. To tell you the truth, I know nothing about
the credit rating industry, but let’s just suppose that’s the case. Furthermore, let’s suppose that the cost ratio
is 5 to 1. I’ve got the credit dataset open here, and
I’m going to run J48. What I get is an error rate of 29.5%, a success
rate of 70-71%. Down here is the confusion matrix. I’ve copied those over here on to this slide. You can see that the cost here, the number
of errors, is effectively the 183 plus 112, those off-diagonal elements of the confusion
matrix. If errors cost the same amount, that’s a fair
reflection of the cost of this confusion matrix. However, if the cost matrix is different,
then we need to do a different kind of evaluation. On the Classify panel, we can do a cost-sensitive
evaluation. Let me go and do that for you. In the More options menu, we’re going to do
a cost-sensitive evaluation. I need to set a cost matrix. This interface is a little weird. I need a 2 by 2 matrix; I’m going to resize
this. Here we’re got a cost of 1 for both kinds
of error, but I want a cost of 5 for this kind of error. Just close that and then run this again. Now I’ve got the same result, the same confusion
matrix, but I’ve got some more figures here. I’ve got a total cost of 1027 and an average
cost of 1.027. (There are 1000 instances in this dataset.) Coming back to the slide, the cost here is
computed by taking the 183 in the lower left and multiplying it by 5–because that’s the
cost of errors down there–and the 112 times 1, adding those up, and I get 1027. If I take the baseline, let’s go and have
a look at ZeroR. I’m going to run ZeroR on this. Here it is. Here I get a cost of 1500. I get this confusion matrix. Over here on the slide, there’s the confusion
matrix. And although I’ve only got 300 errors here,
they’re expensive errors, they each cost \$5, so I’ve got a cost of 1500. This is classifying everything as “good”,
because there are more “good” instances than “bad” in this dataset. If I were to classify everything as “bad”
the total cost would only be 700. That’s actually better than either J48 or
ZeroR. Obviously we ought to be taking the cost matrix
into account when we’re doing the classification, and that’s exactly what the CostSensitiveClassifier
does. We’re going to take the CostSensitiveClassifier,
select J48, define a cost matrix, and see what happens. It’s in the meta>CostSensitiveClassifier,
which is here. I can define a classifier. I’m going to choose J48, which is here. I need to specify my cost matrix. I want it 2 by 2. I’ll need to resize that. I need to put a 5 down here. Cool. I’m just going to run it. Now I get a worse classification error. We’ve only got 60-61% accuracy, but we’ve
got a smaller cost, 658. And we’ve got a different confusion matrix. Back here on the slide you can see that. The old confusion matrix looked like this,
and the new confusion matrix is the one on the right. You can see that the number 183 of expensive
errors has been reduced to 66. That brings the cost down, the average cost,
to 0.66 per instance instead of 1.027, despite the fact that we now have a worse classification
rate. Let’s look at what ZeroR does with the CostSensitiveClassifier. It’s kind of interesting because we’re going
to get a different rule. Instead of classifying everything as “good”,
we’re going to classify everything as “bad”. We’re going to make 700 mistakes, but they’re
cheap mistakes. It’s only going to cost us \$700. That’s what we’ve learned today. Is classification accuracy the best measure?
Very likely it isn’t. In real life, different kinds of errors usually
do have different costs. If you don’t know the costs, you just might
want to look at the tradeoff between the error costs, different parts of the space; and the
ROC curve is appropriate for that. But if you do know the costs–the cost matrix–then
you can do cost-sensitive evaluation to find the total cost on the test set of a particular
learned model, or you can do cost-sensitive classification, that is, take the costs into
account when producing the classifier. The CostSensitiveClassifier does this: it
makes any classifier cost-sensitive. How does it do this? Very good question. We’re going to find out in the next lesson. Off you go now and do the activity, and we’ll
see you soon. Bye for now!

1. Kadiri Abdelilah says:
2. Samuel Kaitz says: