brain of mat kelcey


measuring baseline random performance for an N way classifier (#three_strikes_rule)

April 11, 2020

this post is part of my three-strikes-rule series; the third time someone asks me about something, i have to write it up

>>> import numpy as np
>>> from sklearn.metrics import *

consider an 5 way classifier with varying level of support per class; specifically 100 examples of class0, class1 and 20 examples of class2, 3 and 4.

>>> training_data_support = [100, 100, 20, 20, 20]

what's the simplest way to measure what baseline performance is from a random classifier is? We often want to know this value to ensure we don't have silly bugs and/or we are getting some signal from the data beyond random choice.

firstly let's expand things to a dense set of labels (since sklearn metrics don't work with a sparse set)

>>> y_true = np.concatenate([np.repeat(i, n) for i, n in enumerate(training_data_support)])
>>> y_true

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4])

we can do random predictions proportional to the support in the training data

>>> training_data_proportions = training_data_support / np.sum(training_data_support)
>>> y_pred = np.random.choice(range(len(training_data_support)),
>>>                           p=training_data_proportions,
>>>                           size=sum(training_data_support))
>>> y_pred

array([1, 0, 0, 1, 1, 1, 0, 3, 1, 1, 3, 1, 1, 1, 2, 0, 1, 4, 1, 1, 1, 1,
       0, 2, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 4, 2, 1, 4, 0, 2, 1, 0, 1,
       1, 0, 1, 4, 0, 2, 0, 0, 1, 1, 0, 1, 2, 4, 3, 0, 1, 1, 2, 1, 2, 3,
       0, 0, 0, 2, 1, 0, 0, 0, 1, 1, 0, 0, 1, 3, 3, 1, 1, 1, 3, 1, 0, 1,
       0, 0, 1, 0, 1, 1, 4, 1, 3, 3, 1, 1, 1, 0, 0, 1, 0, 2, 1, 0, 1, 0,
       4, 0, 2, 0, 3, 1, 0, 1, 1, 2, 1, 1, 1, 3, 0, 2, 0, 0, 0, 1, 1, 0,
       2, 0, 0, 0, 1, 0, 0, 1, 2, 1, 1, 0, 1, 1, 4, 1, 0, 3, 2, 0, 2, 0,
       1, 3, 4, 1, 2, 0, 1, 0, 0, 1, 4, 0, 1, 3, 4, 1, 0, 1, 0, 1, 1, 4,
       0, 0, 3, 0, 1, 1, 1, 2, 2, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0,
       0, 4, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 3, 1, 1, 2, 1, 3,
       1, 1, 0, 1, 3, 0, 4, 3, 0, 2, 0, 0, 3, 4, 0, 1, 0, 4, 2, 1, 1, 1,
       0, 1, 0, 1, 2, 1, 3, 0, 2, 0, 1, 1, 1, 0, 0, 0, 1, 1])

from this we can calculate standard metrics; if we can't beat these, we've done something realllllly wrong :/

>>> confusion_matrix(y_true, y_pred)

array([[29, 47,  9,  9,  6],
       [40, 36, 11,  6,  7],
       [ 7, 10,  1,  2,  0],
       [ 7,  5,  2,  3,  3],
       [ 7, 10,  2,  1,  0]])

>>> print(classification_report(y_true, y_pred))

precision    recall  f1-score   support
               0       0.32      0.29      0.31       100
               1       0.33      0.36      0.35       100
               2       0.04      0.05      0.04        20
               3       0.14      0.15      0.15        20
               4       0.00      0.00      0.00        20
        accuracy                           0.27       260
       macro avg       0.17      0.17      0.17       260
    weighted avg       0.27      0.27      0.27       260