brain of mat kelcey...
initing the biases in a classifer to closer match training data
February 27, 2020 at 12:00 PM | categories: short_tute, three_strikes_rulethis post is part of my three-strikes-rule series; the third time someone asks me about something, i have to write it up
if we start with a simple model and run some data through it before it's trained we very roughly get a uniform distribution of outputs. this is since the layers are setup to generally output around zero and the biases of the denses layers are init'd as zero.
>>> from tensorflow.keras.layers import *
>>> from tensorflow.keras.models import Model
>>>
>>> inp = Input(shape=(4,), name='input')
>>> out = Dense(units=5, activation='softmax')(inp)
>>>
>>> model = Model(inputs=inp, outputs=out)
>>>
>>> X = np.random.random(size=(16, 4)).astype(np.float32)
>>> model(X)
array([[0.27802917, 0.21119164, 0.13478227, 0.13102761, 0.24496922],
[0.24828927, 0.2713407 , 0.17112398, 0.17032027, 0.13892575],
[0.15432678, 0.37243405, 0.18875667, 0.18821105, 0.09627141],
[0.22438918, 0.3494182 , 0.16165043, 0.1623524 , 0.10218978],
[0.36323798, 0.1736943 , 0.11997112, 0.11338428, 0.22971232],
[0.21113336, 0.2970533 , 0.19298542, 0.18329948, 0.11552842],
[0.18894033, 0.3023035 , 0.17413773, 0.17320155, 0.16141686],
[0.25081626, 0.24333045, 0.17587633, 0.17023188, 0.15974507],
[0.28999767, 0.2593622 , 0.13038006, 0.13167746, 0.18858269],
[0.28792346, 0.21455322, 0.17145245, 0.1603238 , 0.16574705],
[0.3028727 , 0.20455441, 0.11584676, 0.11711189, 0.2596142 ],
[0.19915669, 0.34796396, 0.1685621 , 0.17137761, 0.1129396 ],
[0.36104262, 0.20813489, 0.12217646, 0.11595868, 0.1926873 ],
[0.34652707, 0.20086858, 0.13397619, 0.12669212, 0.191936 ],
[0.31914416, 0.1853605 , 0.1266775 , 0.12235387, 0.24646394],
[0.21416464, 0.29376236, 0.17110644, 0.17128557, 0.14968103]], dtype=float32)
but if we know the expected distribution of class labels (e.g. from training data)
we can seed the bias values in the classifier layer (out
) to have it reproduce
these proportions at the start of training.
we do this with a bias_initializer
.
this can sometimes speed training up.
>>> from scipy.special import logit
>>>
>>> def observed_proportion_logits(shape, dtype=None, partition_info=None):
>>> # assume following counts of labels in training data
>>> class_counts = np.array([10, 5, 5, 100, 100])
>>> # normalise them and return as logits
>>> class_proportions = class_counts / np.sum(class_counts)
>>> return logit(class_proportions)
>>>
>>> inp = Input(shape=(4,), name='input')
>>> out = Dense(units=5,
>>> bias_initializer=observed_proportion_logits,
>>> activation='softmax')(inp)
>>>
>>> model = Model(inputs=inp, outputs=out)
>>>
>>> X = np.random.random(size=(16, 4)).astype(np.float32)
>>> np.around(model(X), decimals=3)
array([[0.024, 0.024, 0.02 , 0.404, 0.528],
[0.032, 0.027, 0.015, 0.326, 0.6 ],
[0.025, 0.017, 0.014, 0.404, 0.54 ],
[0.028, 0.021, 0.019, 0.348, 0.585],
[0.037, 0.019, 0.023, 0.419, 0.502],
[0.029, 0.019, 0.025, 0.445, 0.481],
[0.033, 0.026, 0.024, 0.335, 0.582],
[0.016, 0.02 , 0.013, 0.413, 0.537],
[0.027, 0.016, 0.022, 0.501, 0.434],
[0.037, 0.026, 0.027, 0.432, 0.478],
[0.022, 0.017, 0.025, 0.532, 0.404],
[0.028, 0.025, 0.021, 0.433, 0.493],
[0.019, 0.013, 0.017, 0.48 , 0.471],
[0.035, 0.021, 0.023, 0.397, 0.524],
[0.036, 0.016, 0.014, 0.294, 0.64 ],
[0.015, 0.014, 0.018, 0.45 , 0.504]], dtype=float32)