MLE and MAP — in layman’s terms

Ajit Rajasekharan
4 min readMar 23, 2019

If we have a six-faced dice and we have no upfront information about it (that is, is it a biased dice?), then

  • we can simply throw the dice as many times as we choose (the more we perform the experiment more we are likely to get a better estimate)
  • and then estimate the probability of us getting each face simply by counting the number of times we got that face and dividing that number by the number of times we performed the experiment. We do the same for all other faces (actually we need to calculate this only for five of the faces — the last one is determined just by those five values since the sum of all of them have to add up to 1 — the certainty that one of them occurs in an experiment). So if we did this experiment 1000 times and we got 200,100,300,80,120,200 (chose them as multiples of 10 for ease), then the probability distribution is .2, .1, .3, .08, .12, .2
  • this is MLE (maximum likelihood estimation) — estimating the probability distribution by choosing values that maximally fit the experimental observations.

On the other hand if we have some upfront knowledge that the dice has a bias (i.e. it has some chipped/rounded edge making it landed on more often on a couple of faces), then we can factor it into our calculations of estimating the probabilities of each face showing up.

  • For example, if we capture our knowledge of the bias of the dice in a distribution like (.1,.1, .3, .3, .1, .1) then we can factor that into our estimation.
  • So we if use the results from the experiment we did before (.2, .1. .3, .08. .12. .2 ) and add the prior values (.1, .1, .3, .3, .1, .1) and just divide the result by 2 to make them all nicely add up to 1 (normalized), we have (.15, .1, .3, .19, .11, .15). Alternatively, we can imagine we did the experiment 2000 times and got results 300,200,600,380,220,300 which again yields the same values (.15, .1, .3, .19, .11, .15). The imagined distribution could be any distribution (that is all the probabilities values add up to 1) — the numbers were chosen to make arithmetic simple.
  • This is MAP (maximum aposteriori estimation)
  • We can see from this example, MLE is just a special case of MAP where were we assume all outcomes are equally likely. However, there is a subtle difference in that, if we have very little data (that is, we conduct the experiment very few times), it is possible a particular face may never show up and we would…
Ajit Rajasekharan

Machine learning practitioner