Loading the Awesome!
geomblog
Fano's inequality is a powerful tool from information theory that is typically used to prove lower bounds for minimax estimation. But it has also been used in algorithm design to prove information-theoretic lower bounds on data structure size or even dimensionality of an embedding.
The method works (like many lower bounds do) via an algorithm design. Suppose for example you wanted to prove a lower bound on the size of a data structure for a lookup problem. You do this in two steps: firstly you demonstrate a procedure that can reconstruct a random set of inputs from a given data structure with reasonable probability. Then you invoke Fano's inequality to say that because you can do this, the mutual information of the "channel" that builds the data structure must be large, and so the number of bits needed must be large.
By Yao's minimax principle, this gives you a lower bound on randomized constructions. It's powerful also because it doesn't even care how the data structure is built: all of this is absorbed into the black-box channel.
Unfortunately, many of the standard explanations of Fano's theorem are information-theory- or statistics-focused. What I'll try to do here is a more "algorithmic" rendition of the method.
geomblog
As with most things, our story starts with pigeons. The pigeonhole principle, to be exact.
Proof: Suppose we used $b < \log n$ bits to encode an element from the set. Then there exists some $b$-bit string that encodes two elements by the PP, and we cannot decode which of the two elements was presented to us.
We can generalize this further.
Let $X$ be a random variable taking the value $i \in [n]$ with probability $p_i$. Then the average number of bits needed to encode $X$ is
$$ H(X) = \sum_i p_i \log \frac{1}{p_i} $$
geomblog
The Pigeonhole Principle and Shannon's theorem are absolute statements. Either you use sufficient space or you will make mistakes.
But how many mistakes ? In other words, if I need $H(X)$ bits on average, and I use instead $b \ll H(X)$ bits, how horrible will my error be when attempting to reconstruct $X$ (expressed as a function of $b, X$) ?
This is where Fano's inequality appears. Let $h(x) = -x \log x - (1-x)\log(1-x)$.
Fano's inequality: Let $X$ be a random variable with range $M$. Let $\hat{X} = g(Y)$ be the predicted value of $X$ given some transmitted value $Y$, where $g$ is a deterministic function. Let $p_e$ be the probability of making a mistake: i.e $p_e = \Pr[\hat{X} \ne X]$. Then
$$ p_e \ge \frac{H(X|Y) - 1}{\log |M|} $$
or more usefully,
$$ h(p_e) + p_e \log(|M|-1) \ge H(X|Y) $$
geomblog
There are many places to find proofs of Fano's inequality, and they generally work by manipulating conditional entropy carefully. But there's a simple algorithmic way to think about what the inequality is saying that I find more intuitive.
And more useful from the perspective of lower bounds.
We can think of the process of going from the input $X$ to the final result $\hat{X}$ as a two-stage process.
We understand the first process quite well. Information-theoretically, we can write it as
$$ H(X) = I(X;Y) + H(X|Y)$$
In words, the information content of $X$ gets split into the part carried by $Y$ (which is $I(X;Y)$) and the remaining "uncertainty" we have about $X$ even when we're given $Y$ (which is $H(X|Y)$).
geomblog
We're trying to reconstruct $X$ from $Y$. But here's the crucial step:
And here's the idea. Let's say I had a deterministic procedure $g$ that produced $\hat{X}$ such that $\Pr[\hat{X} \ne X] = p_e$. Note that the probability is being evaluated over the randomness in $X$, not in the algorithm.
Here's my encoding scheme. Given some $X = x$, I first compute $Y$. Then I inspect $g$ to see if this is one of the values of $X$ on which I make a mistake. If so, then I merely write down $x$ as my "encoding". I signal this by encoding a bit that says "I made a mistake". So my entire encoding is: $Y \circ \textsf{[did I make a mistake]} \circ x \ \text{(if needed)}$.
How many extra bits do I need to encode this ? I need to encode the $\textsf{did I make a mistake}$ bit. This is a random variable over $\{0,1\}$ with probability $p_e$ of being $1$. Its entropy is $-p_e\log p_e - (1-p_e)\log(1-p_e) = h(p_e)$.
To encode the "if needed" part, I need (in expectation) $p_e \log |M|$ bits.
geomblog
My encoding for $X$ in total added $p_e\log |M| + h(p_e)$ bits. But now I can reconstruct $X$ perfectly !
This means that the extra bits better be enough to account for the uncertainty in $X$ given $Y$ given by $H(X|Y)$. If not, then I will have an encoding of $X$ in less than $H(X)$ bits, which is a contradiction.
Therefore,
$$ p_e\log |M| + h(p_e) \ge H(X|Y)$$
Notes:
geomblog
Note that $H(X|Y) = H(X) - I(X;Y)$. This means that high conditional entropy implies low mutual information, and vice versa.
The traditional use of Fano's inequality in the statistics world fixes the conditional entropy (or mutual information):
geomblog
The minimax approach is the one you'll commonly find in explanations of Fano's inequality. But the algorithmic implication is even more interesting.
Let's flip the relationship around, and instead fix the error rate.
geomblog
Further thoughts: