# Entropy of a probability distribution — in layman’s terms

Entropy of a probability distribution is the average “element of surprise” or amount of information when drawing from (or sampling) the probability distribution.

Lets consider a probability distribution composed of two outcomes — “sun will rise tomorrow” (probability — 1) and “sun will not rise tomorrow” (probability 0 ) — (the numerical values 0 and 1 are just chosen for illustrative purposes. Perhaps the distribution should have been something more like .9999999999 and .000000001).

There is no surprise if someone samples this distribution and tells us the outcome before sunrise everyday. We already know what the outcome is given we know the distribution already to be of the form 1,0 — sun has been rising every morning our entire lives and for the last ~5 billion years — so there is hardly any information in what they tell us.

However, if someone tells us tomorrow the sun will not rise, we might dismiss them immediately but only after a fleeting moment of surprise — the outcome was not what we expected given the distribution we have in our heads.

On average, the element of surprise when drawing from this distribution would be 0, given the values of the probabilities of the two possible outcomes.

• Numerically, the entropy of a probability distribution is a weighted average of log of the probabilities.
• So in the above case the entropy or information content would be — (0*log 0 + 1*log 1) = 0.

Lets take another example where two equally good teams in a sport are competing and there is 50% chance either could win. In this case any draw from the distribution would be of great value because we don’t know upfront given the probability distribution who is likely to win. The information content or element of surprise of any draw from this distribution is high. For this reason, the average information content in this distribution (.5, .5) is higher than the previous distribution (1,0).

So the two extreme cases of probability distributions over a finite set of N events

• where just one event has probability 1 and the others 0. In this case we have complete knowledge of what would happen given this distribution. So the average element of surprise from any draw from the distribution, or the average information content of the distribution is 0.
• where every event is equally likely — so each event has probability 1/N. We have the least amount of knowledge of what would happen, even given the distribution, and so any draw from the distribution has an element of surprise or information value . So the entropy of the system is the highest, log N. Any other probability distribution over the N events would be less than this maximum entropy of log N.
• Between these two extremes lie all the other probability distributions whose average element of surprise is within the range (0, logN)

Is this interpretation of entropy as information content/element of surprise in contradiction to a layperson usage of information content?

• In the competition example between equally good teams, “there is 50–50 chance a team would win” statement describes the distribution and seems to have “no information content” from a layperson perspective.
• However, a draw from that distribution (team X won or team Y won) has high amount of information given the uncertainty in the distribution.
• A layman’s usage when applied to the draw from distribution is in concurrence with the interpretation of entropy as information content. Layman’s use is in contradiction to the entropy view of information content when it is applied to the distribution itself — e.g. “50–50 chance team would win”

How does interpretation of entropy as information content/element of surprise reconcile with the concept of entropy in a thermodynamics context?

• Imagine we introduce a single molecule of a gas in the corner of a box and are asked to predict where it will be over time. If we were to model the position of the molecule as probability distributions at different instances, one could model it as follows — two different distributions for the two time points
• The average information content in the distribution at time t= 0, is same as the distribution we saw earlier — no surprise on where the molecule will be when we draw from the distribution. This has low entropy on average — hence a low entropy distribution
• At time t = 30, the molecule has diffused into the chamber. We are modeling this as a uniform distribution above. Any draw from this distribution has high element of surprise since we don’t know where the molecule will be. This is a high entropy distribution
• Again from a layman perspective, at time t= 0, there seems to be “order” (high information — position of molecule known to high accuracy) and at time t=30 the order is gone — the molecule could be anywhere (disorder, loosely “no information content”). This is because these statements describe the distribution loosely. However, from an entropy perspective, the information content is the reverse — lowest at t=0 and highest at t=30.
• The information content is lowest at time t=0, since any draw is likely to find the molecule at the corner mostly, given the distribution.
• At time t=30, a draw can find the molecule anywhere in box. So the information content of the draw is high given the uncertainty present in the uniform distribution it is drawn from. We do not know just from the distribution where the molecule is — only a draw reveals it. Given the uniform distribution — any draw is surprising, so its entropy on average is the highest.

References

This is a nice paper on the relationship between probability distributions and maximum entropy based on the Shannon definition of entropy http://www.math.uconn.edu/~kconr...

This paper by Keith Conrad utilizes the Shannon definition of entropy to illustrate a useful concept — principle of maximum entropy. The question it answers is how do we choose a probability distribution for some “phenomenon/experiment” that can make predictions that is least surprising. That is,

• if we are observing a phenomenon where all possible outcomes are equally likely then probability distribution over all the outcomes is simply the uniform distribution with maximum entropy.
• However if we observe/learn outcomes are not uniform — that is say one outcome is more likely over the other ( e.g. biased coin ) — the assignment of probabilities needs to reflect this and the “ideal” probability distribution in this case would be the one that both satisfies those constraints as well as has the highest entropy as possible — such a probability distribution would be the least surprising in terms of the predictions it makes. Instead of making such a choice for the probability distribution, if we conservatively choose the probability distribution, with the highest entropy ( e.g. uniform distribution ) it may be wasteful — e.g. if we know a message is mostly composed of just 3 of 5 letters, then assigning a code of same length to all 5 would be a wasteful usage of the communication channel. On the other hand choosing a probability distribution with smaller entropy satisfying the constraints would say something stronger than what we are assuming. The best choice of probability distribution would be the one that both satisfies the constraints and has the highest entropy. His paper illustrates three probability distributions with maximum entropy for three different cases -
• uniform distribution for the case a finite set of equally likely outcomes
• Gaussian for continuous probability distributions on real numbers with known variance. A Gaussian distribution has this distinguishing property among all continuous probability distribution in that it is the distribution with the highest entropy when we know the variance (the mean is absent in final formula — all Gaussians with the same variance have the same entropy). Intuitively, the symmetric shape of the Gaussian with the spread of the bump governed by the variance ( regardless of where it is centered around i.e. independent of mean) perhaps makes it the distribution with the highest entropy among all other continuous distributions with known variance (the constraint that all outcomes are not equally likely).
• exponential distribution in the case of an experiment with positive outcomes whose mean is known.

Originally published at www.quora.com.

Machine learning practitioner

## More from Ajit Rajasekharan

Machine learning practitioner