Performance metrics to evaluate a machine learning model

Not all model scoring methods are created equal

Mohit Garg
13 min readFeb 1, 2019

In machine learning, we have various metrics to evaluate performance of a model ,for example accuracy, confusion matrix, Precision, Recall, F1-Score, log-loss, AUC, Error distribution, MAD(median absolute distribution). Some of them are for classification and some for regression task. Lets stick to classification metrics in this part and metrics related to regression will be another part of this article.

Its very important to select right performance metric, selecting a performance metric mainly depends upon how datapoints distributed in dataset and related industry experience which plays a vital role. We get much clear picture as we proceed in this article.

Throughout this article i will be using a simple machine learning classification problem as a example to explain most of the performance metrics .

Credit Card transaction — here task is to find if a transaction is fraud or not. ‘0’ or ‘negative’ means — transaction is genuine, ‘1’ or ‘positive’ means — transaction is fraud.

Now lets discuss performance metric mentioned above one by one.

Accuracy

Accuracy can be defined as the ratio of number of correctly classified points to total number of datapoint in test dataset.

Fig 1 — Accuracy formulae

Accuracy is most fundamental approach to evaluate a model but it not work in most of the cases and often misleads about a model performance.

Lets understand step by step when and when not accuracy is useful.

When accuracy is useful

Accuracy is a good performance measure of a model when dataset is balanced or nearly balanced. For example consider IRIS dataset in which all three categories of iris namely setosa, versicolor, virginia have 50 datapoints respectively. Hence if a model gives an accuracy of 95% then we can conclude model is capable of correctly classifying all three categories or the model is unbiased.

When accuracy is not useful

Case I : Imbalance dataset

In real life scenario, most of the time we encounter problem of imbalance dataset ,when one of the class dominates compare to other classes, i.e. datapoint of a particular class presents in majority and other classes present in minority.

Consider an example:

Fig 2 -Model comparison for imbalance data on the basis of accuracy

As shown in diagram, consider a model M1 which no matter what datapoint is input returns class label as ‘red’, this model is no more than a dumb model but it returns can accuracy of 90% on above data.

Hence despite of getting very good accuracy we know our model learns nothing or its simply a useless model.

On the other hand M2, which gives an accuracy of 80% logically sounds better model than M1. But on the basis of accuracy as performance metric, one prefers model M1 over Model M2.

Case II :Picking right model among similar accuracy model

Fig 3- Comparison among a low confidence and a high confidence model

Now Model A and Model B both are similar model on the basis of accuracy as both giving an accuracy of 95%, so one can pick any model but is it really so and the answer is “NO”.

Both Model A and Model B works similar but Model B is more confident on its decision as compare to Model A, hence their is a less possibility of failing of Model B compare to Model A for future points.

Therefore how confident a model in making prediction can not be calculated only by looking its accuracy.

Confusion Matrix

As name like work, confusion matrix outputs a matrix which gives model performance class wise and much helpful to visualize if a model is not biased towards a class.In other words, in confusion matrix cells(each box) contain the number of correctly classified and mis-classified points for each class in a organized manner.

Confusion matrix mainly used for binary classification task but can be easily extended to multi-class classification as shown in the figure below.

Fig 4- Confusion Matrix: (a) for binary classification (b) for multi-class classification

As shown in figure above we encounter same new terms and its better to explain each in depth to avoid any confusion.

  1. True Negatives (TN): Points when both actual class label and predicted class label are 0(False) are come under True negatives.

For example : As per our toy example mentioned above, all transaction which are actually genuine and model also predicted as genuine are come under True negative.

  1. True Positives (TP): Points when both actual class label and predicted class label are 1(True) are come under True positive.

For example :All transaction which are actually Fraud and model also predicted as fraud are come under True positive.

3. False Positives (FP): Points when actual class label of a datapoint is 0(False) but model predicts as 1(True) are come under false positive. False is because the model has predicted incorrectly and positive because the class predicted was a positive one.

Ex: All transaction which are genuine but model predicts as fraud come under False positive.

4. False Negatives (FN): Points when actual class label of a datapoint is 1(True) but model predicts as 0(False) are come under false negative. False is because the model has predicted incorrectly and negative because the class predicted was a negative one.

Ex: All transaction which are fraud but model predicts as genuine, come under False positive.

Now comes the fun part, we can optimized our model as per domain which is not possible using only accuracy, lets understand how.

Case I: Minimizing False Positive

Consider Spam E-mail detection example, since our task is to detect spam email, all the spam email are labelled as 1 and important email as 0.We encounter two types of error is this problem and to optimize model we have to reduce this errors, but which error we have to reduce depends on problem, requirements ,lets see how.

Types of error:

  1. Email is spam but model detects as important — In this type of error model predicts an email as an important one but actually the email is spam mail.
  2. Email is important but model detects as spam — In this type of error model predicts an email as an spam one but actually the email is important mail.

If a spam error is classified as important is a problem but not as big as a important email classified as spam since their might be some very important information which we unable to get just because it is classified as spam by our machine. Hence our primary goal is to reduce error when model detects email as spam but actually its an important one which is minimizing False positive.

Case II: Minimizing False Negative

In credit card fraud detection example we can classify errors occurred in making prediction as follow:

  1. Transaction which are fraud but model detects genuine — In this type of error, a transaction corresponding to a credit is fraud but our model detects it as a normal transaction.
  2. Transaction which are genuine but model detects fraud — In this type of error, a transaction corresponding to a credit is normal transaction but our model detects it as a fraud transaction.

Its very easy to get that our primary goal must be minimizing error in which transactions are are fraud but model detects genuine as this can results in huge loss to credit card holder and credit card company must be warned before transaction completes and minimizing such error termed as minimizing false negative.

When confusion matrix is useful:

  1. It can calculate accuracy of model.
  2. Check if model is not biased towards a particular class.
  3. Also can compare different models in much better way using TP, FP, FN, TN etc.

When confusion matrix is not useful:

  1. It is unable to compare model which have similar confusion matrix but model can behave differently for future points.(This i explained in depth in log-loss below in this article).

Their are some terms which are related to confusion matrix, lets have a look to them.

True positive Ratio: It is the ratio of number of true positive datapoints to total number of positive datapoints.

True negative Ratio: It is the ratio of number of true negative datapoints to total number of negative datapoints.

False positive Ratio: It is the ratio of number of false positive datapoints to total number of negative datapoints.

False negative Ratio: It is the ratio of number of false negative datapoints to total number of negative datapoints.

To define a optimal model, a model having high TPR and TNR and low FPR and FNR considered as a good model.

Precision

It can be defined as, “From all the points which model predicted as positive, number of points actually are positive divided by total number of points predicted by model as positive”.

Recall

It can be defined as, “From all the points which model predicted as positive, number of points actually are positive divided by total number of positive points”.

Fig 5- Visual representation of Accuracy , Precision and Recall

F1-Score

Instead of using both precision and recall in order to find optimal model its better to find any score which represents both precision and recall.

This score termed as F1-score and can be computed using harmonic mean of precision and recall, but question arises why harmonic mean, why not arithmetic or geometric mean. Lets understand this before making any conclusion.

We can select any of the three types of mean namely AM, GM, HM and discussion below is basically is to make right choice among them.

Fig 6 — Visual representation why H.M. is suitable as F1-score

We Know AM ≥ GM≥ HM:

Case I: Precision , Recall are high and P~R.

In this case we can choose any mean as F1-Score as it will not make much difference since AM~GM~HM, as shown in figure.

Case II: Precision, Recall both are low and P~R.

In this case also we can choose any mean as F1-Score as it will not make much difference since AM~GM~HM, as shown in figure.

Case III: Precision >> Recall or vice versa

This case is important and becomes the deciding factor why H.M. is selected as F1-Score formulae. Here A.M. will lie in between R and P, G.M. will a little close to smaller among(P or R) but H.M. will lie more closer to smaller number.

We want F1-Score such that if any of the two precision or recall is much less than F1-Score indicates that too and therefore H.M. of precision and recall as F1-Score is best choice.

Summing up everything:

F1-Score can be defined as harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.(Definition source:wikipedia).

Formulae for F1-Score

Before moving further lets understand a term, “probability score” which used frequently in coming metrics.

Most of the model works on probability score approach to classify datapoints. Let class label are 0 and 1, hence when a model makes a prediction if probability of point > 0.5 it classified as 1 and if probability <0.5 it classified as 0. Here 0 , 1 are class label but exact probability score on the basis of which model classify points is called probability score.

Log-Loss

log-loss is data scientists most loving performance metric which used for classification problems and is based on probability scores.

Loss loss can be defined as average negative log of probability of correct class labels.(hard to understand now but not after going through the discussion below).

Fig 7 — Model A & Model B of similar accuracy but different confidence for decision making

Consider above two models (A and B), models work on probability approach i.e. if 0< prob(X) <0.5 point belongs to class Red (or ‘0’) and if 0.5< prob(X) <1 than point belongs to class Green(or ‘1’).

Both model A and B have same accuracy, will form similar confusion matrix, all other metrics such as precision, recall , F1-score etc will be same hence no above metric is able to distinguish between Model 1 & Model 2. Hence to pick more suitable model log-loss can be used, it can compare two models with exactly similar results.

In Model 1, model correctly classified most of the points but mostly points lie close to the decision boundary which means model is not much confident in making decisions.Red(or ‘0’) labelled point has probability slightly less than 0.5 and Green( or ‘1’) labelled point has probability slightly greater than 0.5.

On the other hand in model 2 we get exactly same accuracy but model is much confident in making decision as most of the points are far away from decision surface. Red(or ‘0’) labelled point has probability much close to 0 and Green( or ‘1’) labelled point has probability much close to 1.

Therefore we can conclude even though model 1 & 2 have same performance values such as accuracy, precision , recall , F1-score etc then also model 2 is far better than model 1.

Computing log loss

Fig 8 — table to compute log-loss and compare different models

Consider above table, since only x4 prediction is wrong by both models and both model predicted everything same we conclude accuracy for both is 83.33% as per above data, log-loss for model 2 is much less than model 1(Remember log-loss is a loss hence lesser its value better the model is).

We can calculate its value using formulae below which is self explanatory.

Fig 9 — log-loss formulae for (a) binary (b) multi-class classification

When log-loss is useful:

  1. Comparing model with similar performance values such as accuracy, precision, recall, F1-score etc.

When log-loss is not useful:

Although log-loss is very useful, but only reason why data scientist not use it much because it seems very less interpretable as its value starts from 0 and have no upper bound. In that situation its hard to understand how well our model performs like in accuracy we know if we get accuracy in 90s than our models work fine etc. But in log-loss its not that simple for example if someone says my model has an log-loss of 0.25 than how to interpret if this result is good or not.

Well this can also be taken care with a small effort, we can form a base model which sets as a benchmark to visualize performance of a model on the basis of its log-loss score.(this is only a hack and not standard approach but believe me it is much better than many standard approaches).

Setting benchmark

Create a random model R, which randomly labelled datapoints, since its a random model its result are worst performance we get and this is worst model we get . Compute log-loss for Model R say log-loss for R is 5.3, now this is the worst result we can get.

Theoretically, we can compare 5.3 as 0% accuracy and 0 log-loss as 100% accuracy.

Now as accuracy of a model varies between 0 and 100, similarly now log-loss varies between 5.3 and 0.

Case I: If our model log-loss > 5.3(very rare case)

We can conclude that our model is dumbest model which works even worse than random model.

Case II: If our model log-loss little less than 5.3

We can conclude our model is just ok and we have to improve much.

Case III: If our model log-loss <<5.3

We conclude our model works phenomenal and we form a good model.

ROC-AUC

ROC-AUC curve is another kind of performance measurement metric for binary classification models which can be extended to multiclass classification.

Talking about intuition behind AUC-ROC curve, it basically tells how much a model is able to separate to distinct class labels. AUC varies between 0 and 1, higher the value better the model able to separate two classes.

Roc(or Receiver Operating Characteristics) curve is plotted with TPR against the FPR where TPR is on y-axis and FPR is on the x-axis and the area under this curve is called AUC (Area Under Curve).

How AUC-ROC curve forms

Fig 10 — table shows how auc-roc curve formed

Consider the table above, to form AUC-ROC curve their a small algorithm lets understand algorithm step-by-step:

  1. Corresponding to every datapoint, compute probability score.
  2. Arrange the datapoints as per decreasing order of probability score.
  3. Let threshold1 = prob. score 1, threshold2 = prob. score 2 and so on .
  4. Add n columns in table, each column corresponds to each threshold.
  5. If prob. score of a datapoint is more than column threshold value then column value is 1 else 0 for that datapoint.
  6. Compute TPR, FPR for each column i.e. corresponding to every threshold value. (For ex TPR1, FPR1 for threshold 1 column, TPR2, FPR2 for threshold 2 column and so on).
  7. We get TPR1, TPR2 , … , TPRn and FPR1, FPR2, … , FPRn from above computation.
  8. Plot TPR’s on y-axis and FPR’s on X-axis.
  9. This forms ROC curve and area under it termed as AUC.
Fig 11 — AUC-ROC curve

Case I: 0.5 < AUC <1

If the value of AUC for a model is greater than 0.5, that means it is able to separate two distinct classes. More its value more better separability is.

Case II: AUC = 0.5

When AUC is 0.5 we can conclude that our model is not at all capable of classifying(or separate) two distinct classes datapoints or in other words our model is just a random model.

Case III: 0< AUC<0.5

This gives an indication that something went wrong while training the model. For example, if AUC is 0.25 that means we can get AUC equal to 0.75 with same model just by swapping the class labels i.e. just swap 0 and 1 in model and new AUC is equal to 1- old AUC.

Thanks for Reading.

I hope everyone get a good intuition about classification based performance metrics, their are same regression based performance metrics which i will be covering in next part of this article.

--

--

Mohit Garg

Data Science Enthusiast with demonstrated history in Finance | Internet | Pharma industry. Skilled in Python | Machine learning | NLP | Computer vision.