From Hartley to Shannon

Shannon asked: “How much information can a channel carry?”

He wanted a function that measures “how surprised will I be by the outcome?” For a fair coin, you’re more surprised than a coin that always lands heads. He wrote down some reasonable properties any such function should have and derived the only formula that satisfies them.

Start with the simplest case: equally likely outcomes, each with probability 1s​.

Call the uncertainty A(s). We just need to know: what properties must A(s) have?

Property 1: Monotonicity. More choices = more uncertainty. So

A(s1)<A(s2)ifs1<s2A(s_1) < A(s_2) \quad \text{if} \quad s_1 < s_2

Property 2: Consistency. 

Imagine choosing 1 letter from an alphabet of s^m symbols. That should be equivalent to choosing mm letters one at a time from an alphabet of ss symbols. So:

A(sm)=mA(s)A(s^m) = m \cdot A(s)

This single equation forces A(s)A(s) to be a logarithm. Here’s why:

If A(s)=klogsA(s) = k \log s, let’s try it:

A(sm)=klog(sm)=kmlogs=mklogs=mA(s)

It works! And it’s the only monotonic function that works. So entropy must be logarithmic.

Next: 

The goal is to prove that A(t) = k log t works for any real number t, not just numbers like 4, 8, 27 that happen to be perfect powers of some base.

A(t)=klogtA(t) = k\log t for any t, not just perfect powers.

The trick: for any s,t, you can always find integers m,n such that

smtn<sm+1s^m \leq t^n < s^{m+1}

Taking logs:

mlogsnlogt<(m+1)logsm \log s \leq n \log t < (m+1) \log s

Dividing by nlogs:

mnlogtlogs<mn+1n\frac{m}{n} \leq \frac{\log t}{\log s} < \frac{m}{n} + \frac{1}{n}

As nn \to \infty, the gap 1n0\frac{1}{n} \to 0, so logtlogs\frac{\log t}{\log s} is pinned down exactly. This means A(t) has no wiggle room — it’s the unique solution.

Now the harder case: what if outcomes have different probabilities p_i?

We use a decision tree argument. Here’s the key idea:

Imagine n equally likely outcomes split into k groups of sizes n1,n2,,nkn_1, n_2, \dots, n_k, where

n1+n2++nk=n

Two ways to count the uncertainty:

Way 1 — All at once:
uncertainty =klogn= k \log n (uniform over n outcomes)

Way 2 — In two stages:

• First, pick which group you’re in: uncertainty =H(p1,,pk)

• Then, pick within the group: klognik \log n_i for each group i, weighted by pi​

Setting them equal:

klogn=H(p1,,pk)+kpilognik \log n = H(p_1, \ldots, p_k) + k \sum p_i \log n_i

Rearranging:

H(p1,,pk)=klognkpilogniH(p_1, \ldots, p_k) = k \log n - k \sum p_i \log n_i

Since

pi=ninp_i = \frac{n_i}{n}

we substitute

logni=log(pin)=logpi+logn\log n_i = \log(p_i n) = \log p_i + \log n

which gives

H(p1,,pk)=kpilogpi​

That’s Shannon entropy. The k is just a constant that depends on your choice of log base (base 2 gives bits, base e gives nats).


Comments

Popular posts from this blog

Why Information is Logarithmic: Hartley’s 1928 Insight

An interview with a lawyer on Public Policy and Law

my family! Guest post by 7yo niece Part III