Loading the Awesome!

geomblog

Fano's inequality is a powerful tool from information theory that is typically used to prove lower bounds for minimax estimation. But it has also been used in algorithm design to prove information-theoretic lower bounds on data structure size or even dimensionality of an embedding.

The method works (like many lower bounds do) via an algorithm design. Suppose for example you wanted to prove a lower bound on the size of a data structure for a lookup problem. You do this in two steps: firstly you demonstrate a procedure that can reconstruct a random set of inputs from a given data structure with reasonable probability. Then you invoke Fano's inequality to say that because you can do this, the mutual information of the "channel" that builds the data structure must be large, and so the number of bits needed must be large.

By Yao's minimax principle, this gives you a lower bound on randomized constructions. It's powerful also because it doesn't even care how the data structure is built: all of this is absorbed into the black-box channel.

Unfortunately, many of the standard explanations of Fano's theorem are information-theory- or statistics-focused. What I'll try to do here is a more "algorithmic" rendition of the method.

geomblog

As with most things, our story starts with pigeons. The pigeonhole principle, to be exact.

Armed with this, you can get to something that's actually useful

But let's reword the proof of this fact ever so slightly.

**Proof:** Suppose we used $b < \log n$ bits to encode an element from the set. Then there exists some $b$-bit string that encodes two elements by the PP, and we cannot decode which of the two elements was presented to us.

We can generalize this further.

Let $X$ be a random variable taking the value $i \in [n]$ with probability $p_i$. Then the average number of bits needed to encode $X$ is

$$ H(X) = \sum_i p_i \log \frac{1}{p_i} $$

- :

geomblog

The Pigeonhole Principle and Shannon's theorem are absolute statements. Either you use sufficient space or you will make mistakes.

But how many mistakes ? In other words, if I need $H(X)$ bits on average, and I use instead $b \ll H(X)$ bits, how horrible will my error be when attempting to reconstruct $X$ (expressed as a function of $b, X$) ?

This is where Fano's inequality appears. Let $h(x) = -x \log x - (1-x)\log(1-x)$.

**Fano's inequality:** Let $X$ be a random variable with range $M$. Let $\hat{X} = g(Y)$ be the predicted value of $X$ given some transmitted value $Y$, where $g$ is a deterministic function. Let $p_e$ be the probability of making a mistake: i.e $p_e = \Pr[\hat{X} \ne X]$. Then

$$ p_e \ge \frac{H(X|Y) - 1}{\log |M|} $$

or more usefully,

$$ h(p_e) + p_e \log(|M|-1) \ge H(X|Y) $$

- :

geomblog

There are many places to find proofs of Fano's inequality, and they generally work by manipulating conditional entropy carefully. But there's a simple *algorithmic* way to think about what the inequality is saying that I find more intuitive.

And more useful from the perspective of lower bounds.

We can think of the process of going from the input $X$ to the final result $\hat{X}$ as a two-stage process.

- $X$ is transformed to $Y$ via some unknown process.
- $Y$ is converted to $\hat{X}$ via the process $\hat{X} = g(Y)$.

We understand the first process quite well. Information-theoretically, we can write it as

$$ H(X) = I(X;Y) + H(X|Y)$$

In words, the information content of $X$ gets split into the part carried by $Y$ (which is $I(X;Y)$) and the remaining "uncertainty" we have about $X$ even when we're given $Y$ (which is $H(X|Y)$).

geomblog

We're trying to reconstruct $X$ from $Y$. But here's the crucial step:

In other words, if someone gives me the algorithm $g$, I can use it (and $Y$) to construct a new encoding method for $X$ that's error-free !

And here's the idea. Let's say I had a deterministic procedure $g$ that produced $\hat{X}$ such that $\Pr[\hat{X} \ne X] = p_e$. Note that the probability is being evaluated over the randomness in $X$, not in the algorithm.

Here's my encoding scheme. Given some $X = x$, I first compute $Y$. Then I inspect $g$ to see if this is one of the values of $X$ on which I make a mistake. If so, then I merely write down $x$ as my "encoding". I signal this by encoding a bit that says "I made a mistake". So my entire encoding is: $Y \circ \textsf{[did I make a mistake]} \circ x \ \text{(if needed)}$.

How many extra bits do I need to encode this ? I need to encode the $\textsf{did I make a mistake}$ bit. This is a random variable over $\{0,1\}$ with probability $p_e$ of being $1$. Its entropy is $-p_e\log p_e - (1-p_e)\log(1-p_e) = h(p_e)$.

To encode the "if needed" part, I need (in expectation) $p_e \log |M|$ bits.

geomblog

My encoding for $X$ in total added $p_e\log |M| + h(p_e)$ bits. But now I can reconstruct $X$ perfectly !

This means that the extra bits better be enough to account for the uncertainty in $X$ given $Y$ given by $H(X|Y)$. If not, then I will have an encoding of $X$ in less than $H(X)$ bits, which is a contradiction.

Therefore,

$$ p_e\log |M| + h(p_e) \ge H(X|Y)$$

**Notes:**

- You can get to $\log(|M|-1)$ by realizing that if you know you're wrong, then you know that the value encoded by $Y$ is not the right answer, and the true answer has to be drawn from the rest of the set.
- There exists a joint distribution on $X, Y$ for which this is tight. Let $X$ be such that there's probability $p_e$ of seeing $1$, and a uniform probability of seeing $[2\ldots |M|]$. Set $Y = \emptyset$ (i.e an empty channel).

geomblog

Note that $H(X|Y) = H(X) - I(X;Y)$. This means that **high** conditional entropy implies **low** mutual information, and vice versa.

The traditional use of Fano's inequality in the statistics world fixes the conditional entropy (or mutual information):

If I can show that there isn't too much mutual information between $X$ and $Y$, then Fano's inequality lower bounds the error of any reconstruction procedure, yielding a lower bound. Here, $X$ is my original set of parameters and $Y$ is what I estimate from data (say). This allows us to get minimax bounds for estimation: see Mark Reid's post for more.

geomblog

The minimax approach is the one you'll commonly find in explanations of Fano's inequality. But the algorithmic implication is even more interesting.

Let's flip the relationship around, and instead fix the error rate.

The intuition here is the following: If I can reconstruct with only a small amount of error, then the residual conditional entropy $H(X|Y)$ in the system can't be that much. Therefore, as long as we've constructed an $X$ that has sufficient entropy, the only possibility is that the mutual information between $X$ and $Y$ is large.

But suppose $Y$ is some data structure that we built from the input ? Then this immediately means that the size of $Y$ must be sufficiently large. And that's our lower bound right there.

geomblog

- Note the classic "upper bound to get a lower bound" trick here. In order to prove a
**lower bound**for the size of a data structure, we construct an algorithm that has an**upper bound**on its probability of error. - Viewed this way, using Fano's inequality to prove data structure lower bounds is "merely" a generalization of "if you want $n$ things you need $\log n$ bits

**Further thoughts:**

- It's hard to pinpoint specific applications of Fano's inequality in papers. There are numerous instances, but they tend to be hidden in plain sight. One notably clear example of this is for a non-embeddability result ! Oded Regev gave a simple proof of the inability to do dimensionality reduction for $\ell_1$ essentially by arguing that a "small space" representation of a collection of points would allow for reconstruction beyond the Fano bound.
- There's a very neat technique called "cell sampling" that layers extra tricks on top of Fano's inequality to get even stronger lower bounds. It was first used in a paper by Miltersen and Gal, and Kasper Green Larsen has used it very effectively here and here. It also appears (in disguise) in a paper by Panigrahy, Talwar and Wieder on near neighbor lower bounds, and we use the same trick in our paper on lower bounds for Bregman NN search.