KL Divergence — in layman’s terms

Ajit Rajasekharan
8 min readMar 23, 2019

If we are asked to look at the three animals below and say which one is more of a cat than a dog, most of us would agree that

  • the first one is “all cat and no dog”
  • second one is “more cat than dog”
  • third is “more dog than cat”

Images of animals are from this link Marvin’s review of The Illustrated Encyclopedia of Cat Breeds

If we want a neural net based model to do the same thing (we have gotten good at this in the last few years particularly with neural net based models)

  1. we need to first generate some training data ourselves labeling each picture with some probability assignments like the values shown above ( e.g. 90% [.9] cat ; 10% [.1] dog)
  2. then have the model predict these values for each image in our training set and let the model keep improving its predictions based on far off they are from the values humans assigned to them.
  3. Once the model does this successfully for a large number of pictures, then it is likely to make predictions even for images of cats and dogs never seen before, that would also agree with our estimates.

Focussing specifically on the last part of the second step above, “how does the model calculate how far off it is with its prediction of percentages of ‘catness/dogneess’ of an image”. This “how far off measure” could be done in many ways. One approach is described below

  • Simply adding up all the predicted value assignments and comparing with the human labeled assignments wont do, since the values assigned to any image add up to 100% always, (simply because if a picture is 90 % dog, it is inevitably 10% cat. Same argument applies if we are having more than two categories — cats, dogs, cows etc).
  • However if we do a weighted sum of scores, where
  • the weighting is the estimated prediction (could be human or model prediction) of “catness/dogness”
  • and the score is a function(referred to below as score_function) that we will choose as simply some function of the estimate itself,
  • then we can give a single numeric value for each picture, that captures the distribution of our predictions for cats and dogs.
Ajit Rajasekharan

Machine learning practitioner