If we have a six-faced dice and we have no upfront information about it (that is, is it a biased dice?), then
- we can simply throw the dice as many times as we choose (the more we perform the experiment more we are likely to get a better estimate)
- and then estimate the probability of us getting each face simply by counting the number of times we got that face and dividing that number by the number of times we performed the experiment. We do the same for all other faces (actually we need to calculate this only for five of the faces — the last one is determined just by those five values since the sum of all of them have to add up to 1 — the certainty that one of them occurs in an experiment). So if we did this experiment 1000 times and we got 200,100,300,80,120,200 (chose them as multiples of 10 for ease), then the probability distribution is .2, .1, .3, .08, .12, .2
- this is MLE (maximum likelihood estimation) — estimating the probability distribution by choosing values that maximally fit the experimental observations.
On the other hand if we have some upfront knowledge that the dice has a bias (i.e. it has some chipped/rounded edge making it landed on more often on a couple of faces), then we can factor it into our calculations of estimating the probabilities of each face showing up.
- For example, if we capture our knowledge of the bias of the dice in a distribution like (.1,.1, .3, .3, .1, .1) then we can factor that into our estimation.
- So we if use the results from the experiment we did before (.2, .1. .3, .08. .12. .2 ) and add the prior values (.1, .1, .3, .3, .1, .1) and just divide the result by 2 to make them all nicely add up to 1 (normalized), we have (.15, .1, .3, .19, .11, .15). Alternatively, we can imagine we did the experiment 2000 times and got results 300,200,600,380,220,300 which again yields the same values (.15, .1, .3, .19, .11, .15). The imagined distribution could be any distribution (that is all the probabilities values add up to 1) — the numbers were chosen to make arithmetic simple.
- This is MAP (maximum aposteriori estimation)
- We can see from this example, MLE is just a special case of MAP where were we assume all outcomes are equally likely. However, there is a subtle difference in that, if we have very little data (that is, we conduct the experiment very few times), it is possible a particular face may never show up and we would assign a zero probability prediction. This would clearly be incorrect — because if we use our learnt probability values from our experiment for future experiment outcomes, we would be wrong every time that particular face is the outcome of an experiment. This problem is particularly bad when are making predictions of multiple events occurring together — that one zero will make the entire computation of joint probability to zero. MAP doesn’t have that issue because we can chose a prior distribution that has nonzero value for all outcomes.
MLE and MAP is what common sense tells us to do
- MLE — estimate the probabilities with data we have with no prior assumptions about the underlying probability distribution.
- MAP — If we have some prior knowledge we can factor it into our estimate of the underlying probability distribution along with what the data tells us about the distribution.
What can we use this for?
We can now use this observed probability distribution to predict the value in any future experiment (e.g. what is the probability of getting three sixes in 5 roles etc.)
Why is such an obvious concept such a big deal and often shows up in machine learning papers?
Imagine we are predicting lifespan of a person, that is if he/she will live beyond 60 — a boolean output, by using 50 parameters each of which is also boolean (e.g. gender — male/female, alcohol/no alcohol etc), we clearly can’t collect data, even from a theoretical sense, for just one instance each of all possible 2⁵⁰ combinations since there are only ~10 billion people to collect data from. Our estimate can only be based on a subset even ignoring the practical impossibility of collecting 10 billion samples.
- So if we are going to estimate a probability distribution involving many variables it can only be from a subset data.
- Estimating probability distribution in real world scenarios involving large number of variables essentially involves finding the parameters theta of a function F(X,theta) that best fit the available data. The parameters theta could be the parameters of a probability distribution or learnt weights of a neural net.
- Most machine learning problems, specifically neural net models, to which the current wave of success in machine learning can be largely attributed to, can be viewed as function approximations where the the assignment of values to the parameters (weights) of the network, based on available data, are maximized using methods like MLE and MAP
- In summary MLE and MAP try to find parameters of a function representing a probability distribution that best fits available data.
- Machine Learning — this is an old lecture series back from 2015, but the early sections cover MLE, MAP and other basic foundations of machine learning.
- http://www.math.uconn.edu/~kconrad/blurbs/analysis/entropypost.pdf This paper discusses the potential choice of prior distributions based on our rough understanding of the underlying distribution. However, in addition to this the choice of prior is often determined by computational convenience too
Originally published at www.quora.com.