brain of mat kelcey


initing the biases in a classifer to closer match training data (#three_strikes_rule)

February 27, 2020

this post is part of my three-strikes-rule series; the third time someone asks me about something, i have to write it up

if we start with a simple model and run some data through it before it's trained we very roughly get a uniform distribution of outputs. this is since the layers are setup to generally output around zero and the biases of the denses layers are init'd as zero.

>>> from tensorflow.keras.layers import *
>>> from tensorflow.keras.models import Model
>>>
>>> inp = Input(shape=(4,), name='input')
>>> out = Dense(units=5, activation='softmax')(inp)
>>>
>>> model = Model(inputs=inp, outputs=out)
>>>
>>> X = np.random.random(size=(16, 4)).astype(np.float32)
>>> model(X)

array([[0.27802917, 0.21119164, 0.13478227, 0.13102761, 0.24496922],
       [0.24828927, 0.2713407 , 0.17112398, 0.17032027, 0.13892575],
       [0.15432678, 0.37243405, 0.18875667, 0.18821105, 0.09627141],
       [0.22438918, 0.3494182 , 0.16165043, 0.1623524 , 0.10218978],
       [0.36323798, 0.1736943 , 0.11997112, 0.11338428, 0.22971232],
       [0.21113336, 0.2970533 , 0.19298542, 0.18329948, 0.11552842],
       [0.18894033, 0.3023035 , 0.17413773, 0.17320155, 0.16141686],
       [0.25081626, 0.24333045, 0.17587633, 0.17023188, 0.15974507],
       [0.28999767, 0.2593622 , 0.13038006, 0.13167746, 0.18858269],
       [0.28792346, 0.21455322, 0.17145245, 0.1603238 , 0.16574705],
       [0.3028727 , 0.20455441, 0.11584676, 0.11711189, 0.2596142 ],
       [0.19915669, 0.34796396, 0.1685621 , 0.17137761, 0.1129396 ],
       [0.36104262, 0.20813489, 0.12217646, 0.11595868, 0.1926873 ],
       [0.34652707, 0.20086858, 0.13397619, 0.12669212, 0.191936  ],
       [0.31914416, 0.1853605 , 0.1266775 , 0.12235387, 0.24646394],
       [0.21416464, 0.29376236, 0.17110644, 0.17128557, 0.14968103]], dtype=float32)

but if we know the expected distribution of class labels (e.g. from training data) we can seed the bias values in the classifier layer (out) to have it reproduce these proportions at the start of training.

we do this with a bias_initializer.

this can sometimes speed training up.

>>> from scipy.special import logit
>>>
>>> def observed_proportion_logits(shape, dtype=None, partition_info=None):
>>>     # assume following counts of labels in training data
>>>     class_counts = np.array([10, 5, 5, 100, 100])
>>>     # normalise them and return as logits
>>>     class_proportions = class_counts / np.sum(class_counts)
>>>     return logit(class_proportions)
>>>
>>> inp = Input(shape=(4,), name='input')
>>> out = Dense(units=5,
>>>             bias_initializer=observed_proportion_logits,
>>>             activation='softmax')(inp)
>>>
>>> model = Model(inputs=inp, outputs=out)
>>>
>>> X = np.random.random(size=(16, 4)).astype(np.float32)
>>> np.around(model(X), decimals=3)

array([[0.024, 0.024, 0.02 , 0.404, 0.528],
       [0.032, 0.027, 0.015, 0.326, 0.6  ],
       [0.025, 0.017, 0.014, 0.404, 0.54 ],
       [0.028, 0.021, 0.019, 0.348, 0.585],
       [0.037, 0.019, 0.023, 0.419, 0.502],
       [0.029, 0.019, 0.025, 0.445, 0.481],
       [0.033, 0.026, 0.024, 0.335, 0.582],
       [0.016, 0.02 , 0.013, 0.413, 0.537],
       [0.027, 0.016, 0.022, 0.501, 0.434],
       [0.037, 0.026, 0.027, 0.432, 0.478],
       [0.022, 0.017, 0.025, 0.532, 0.404],
       [0.028, 0.025, 0.021, 0.433, 0.493],
       [0.019, 0.013, 0.017, 0.48 , 0.471],
       [0.035, 0.021, 0.023, 0.397, 0.524],
       [0.036, 0.016, 0.014, 0.294, 0.64 ],
       [0.015, 0.014, 0.018, 0.45 , 0.504]], dtype=float32)