Thursday, December 18, 2025

Shannon's Entropy from the ground up

I'm diverging a bit today. My friend Beli showed me this proof and I understood Shannon's entropy from the ground up for the first time and wanted to share.  

Shannon’s entropy is usually presented as a formula,

H(p1,,pn)=i=1npilogpi,H(p_1,\dots,p_n) = -\sum_{i=1}^n p_i \log p_i,

but the real question is why this expression is forced on us. The answer comes from thinking carefully about uncertainty and how it behaves when decisions are broken into stages.

Consider a situation with nn equally likely outcomes. The uncertainty associated with this situation depends only on n. Call this quantity A(n). If we make a sequence of independent choices, uncertainty should add. That gives the functional equation

A(mn)=A(m)+A(n).

The only solutions of this form are logarithms, so for equally likely outcomes we must have

A(n)=klognA(n) = k \log n

for some constant k, which sets the choice of units.

Now consider a general probability distribution (p1,,pn)(p_1,\dots,p_n). To connect this case to the equal-probability case, approximate each probability by a rational number. Write

pi=niN,p_i = \frac{n_i}{N},

where the integers nin_i sum to NN.

Imagine a two-stage decision process. In the first stage, we choose one of the n groups, where group ii has probability pip_i. In the second stage, once group ii




 is chosen, we choose uniformly among its nin_i equally likely outcomes.

Viewed as a single process, this is equivalent to choosing uniformly among NN outcomes. The total uncertainty is therefore

A(N)=klogN.A(N) = k \log N.

Viewed in two stages, the total uncertainty is the uncertainty of the first choice plus the expected uncertainty of the second choice. The first stage contributes H(p1,,pn)H(p_1,\dots,p_n). The second stage contributes

i=1npiA(ni)=i=1npiklogni.\sum_{i=1}^n p_i A(n_i) = \sum_{i=1}^n p_i \, k \log n_i.

Consistency requires these two descriptions to agree, so

klogN=H(p1,,pn)+i=1npiklogni.k \log N = H(p_1,\dots,p_n) + \sum_{i=1}^n p_i \, k \log n_i.

Rearranging gives

H(p1,,pn)=klogNki=1npilogni.H(p_1,\dots,p_n) = k \log N - k \sum_{i=1}^n p_i \log n_i.

Substitute ni=piNn_i = p_i N. Then

logni=logpi+logN,\log n_i = \log p_i + \log N,

and so

i=1npilogni=i=1npilogpi+logNi=1npi.\sum_{i=1}^n p_i \log n_i = \sum_{i=1}^n p_i \log p_i + \log N \sum_{i=1}^n p_i.

Since the probabilities sum to one, this becomes

i=1npilogni=i=1npilogpi+logN.\sum_{i=1}^n p_i \log n_i = \sum_{i=1}^n p_i \log p_i + \log N.

Substituting back,

H(p1,,pn)=klogNk(i=1npilogpi+logN)=ki=1npilogpi.H(p_1,\dots,p_n) = k \log N - k\left(\sum_{i=1}^n p_i \log p_i + \log N\right) = -k \sum_{i=1}^n p_i \log p_i.

This is Shannon’s entropy formula.

What matters is not the algebra but the structure of the argument. Uncertainty must be additive across independent choices. It must reduce to a logarithm in the equally likely case. And it must behave consistently when a decision is broken into stages. These requirements leave no freedom. The entropy formula is not chosen. It is forced.


No comments:

Post a Comment

Cycles of learning with number theory