*Massive Data Problems*

*A Few Streaming Algorithms*

*Data Sciences Course, Jan 2015*

Ramesh Hariharan

Streaming Algorithms

Data flying past in a router for example

Can't afford to store data; operate and forget (or remember what you can in the space available)

Space is at a premium, how low can we go?

Ramesh Hariharan

Streaming Algorithms: An Example

- Given a binary stream of $n$, count the number of 1s

Ramesh Hariharan

Streaming Algorithms: An Example

Given a binary stream of $n$, count the number of 1s

$\log n$ bits sufficient? necessary?

Ramesh Hariharan

Streaming Algorithms: An Example

Given a binary stream of $n$, count the number of 1s

$\log n$ bits sufficient? necessary?

What if we want just an approximate answer? Say within a factor of 2.

Ramesh Hariharan

Streaming Algorithms: An Example

Given a binary stream of $n$, count the number of 1s

$\log n$ bits sufficient? necessary?

What if we want just an approximate answer? Say within a factor of 2.

Create doubling buckets $1\ldots 1,2 \ldots 3,4 \ldots 7,8 \ldots 15,16\ldots 31 $

Ramesh Hariharan

Streaming Algorithms: An Example

Given a binary stream of $n$, count the number of 1s

$\log n$ bits sufficient? necessary?

What if we want just an approximate answer? Say within a factor of 2.

Create doubling buckets $1\ldots 1,2 \ldots 3,4 \ldots 7,8 \ldots 15,16\ldots 31 $

Suffices to know which bucket the count lies in.

Ramesh Hariharan

Streaming Algorithms: An Example

Given a binary stream of $n$, count the number of 1s

$\log n$ bits sufficient? necessary?

What if we want just an approximate answer? Say within a factor of 2.

Create doubling buckets $1\ldots 1,2 \ldots 3,4 \ldots 7,8 \ldots 15,16\ldots 31 $

Suffices to know which bucket the count lies in.

$\log\log n$ bits sufficient to indicate a bucket. Can we also count in this space?

Ramesh Hariharan

Streaming Algorithms: An Example

Given a binary stream of $n$, count the number of 1s

$\log n$ bits sufficient? necessary?

What if we want just an approximate answer? Say within a factor of 2.

Create doubling buckets $1\ldots 1,2 \ldots 3,4 \ldots 7,8 \ldots 15,16\ldots 31 $

Suffices to know which bucket the count lies in.

$\log\log n$ bits sufficient to indicate a bucket. Can we also count in this space?

Ramesh Hariharan

Counting 1s Approximately

- The Challenge: Recognize when a bucket is done and we're moving to the next bucket, so we can increment

Ramesh Hariharan

Counting 1s Approximately

The Challenge: Recognize when a bucket is done and we're moving to the next bucket, so we can increment

For bucket $2^i\ldots (2^{i+1}-1)$ one needs $i$ bits of count to know that that bucket is over?

Ramesh Hariharan

Counting 1s Approximately

The Challenge: Recognize when a bucket is done and we're moving to the next bucket, so we can increment

For bucket $2^i\ldots (2^{i+1}-1)$ one needs $i$ bits of count to know that that bucket is over?

Use the people counting trick instead.

Ramesh Hariharan

Counting 1s Approximately

For bucket $2^i\ldots (2^{i+1}-1)$ one needs $i$ bits of count to know that that bucket is over?

Use the people counting trick instead.

Ask everyone to raise their hands, then each person tosses a coin and, heads put their hands down, repeat $k$ rounds until raised hands can be counted. Inflated by $2^k$ to get an estimate of actual count.

Ramesh Hariharan

Counting 1s Approximately

For bucket $2^i\ldots (2^{i+1}-1)$ one needs $i$ bits of count to know that that bucket is over?

Use the people counting trick instead.

Ask everyone to raise their hands, then each person tosses a coin and, heads put their hands down, repeat $k$ rounds until raised hands can be counted. Inflated by $2^k$ to get an estimate of actual count.

Ramesh Hariharan

Switching Buckets

- Once in bucket $2^i\ldots (2^{i+1}-1)$, with every 1 that arrives, increment the count with prob $\frac{1}{2^i}$ (how exactly?)

Ramesh Hariharan

Switching Buckets

Once in bucket $2^i\ldots (2^{i+1}-1)$, with every 1 that arrives, increment the count with prob $\frac{1}{2^i}$ (how exactly?)

In other words, for every 1, $c=c+1$ with prob $\frac{1}{2^c}$

Ramesh Hariharan

Switching Buckets

Once in bucket $2^i\ldots (2^{i+1}-1)$, with every 1 that arrives, increment the count with prob $\frac{1}{2^i}$ (how exactly?)

In other words, for every 1, $c=c+1$ with prob $\frac{1}{2^c}$

How is the number of 1s related $c$?

Ramesh Hariharan

Switching Buckets

In other words, for every 1, $c=c+1$ with prob $\frac{1}{2^c}$

How is the number of 1s related $c$?

Family of random variables $X_c$ corresponding to count $c$

Ramesh Hariharan

Switching Buckets

In other words, for every 1, $c=c+1$ with prob $\frac{1}{2^c}$

How is the number of 1s related $c$?

Family of random variables $X_c$ corresponding to count $c$

$X_c$ is the number of 1's required after the count has reached $c$ for the count to be incremented to $c+1$

Ramesh Hariharan

Switching Buckets

In other words, for every 1, $c=c+1$ with prob $\frac{1}{2^c}$

How is the number of 1s related $c$?

Family of random variables $X_c$ corresponding to count $c$

$X_c$ is the number of 1's required after the count has reached $c$ for the count to be incremented to $c+1$

Ramesh Hariharan

Distribution of $X_c$

- $E(X_c)$

Ramesh Hariharan

Distribution of $X_c$

- $E(X_c)$
- $=\sum_{i=1} ^ {\infty} Pr(X_c\geq i)=1+ \sum_{i=1}^{\infty} Pr(X_c>i)$

Ramesh Hariharan

Distribution of $X_c$

- $E(X_c)$
- $=\sum_{i=1} ^ {\infty} Pr(X_c\geq i)=1+ \sum_{i=1}^{\infty} Pr(X_c>i)$
- $=1+\sum_{i=1}^{\infty} (1-\frac{1}{2^c})^i = 2^c$, as expected

Ramesh Hariharan

Distribution of $X_c$

$E(X_c)$

- $=\sum_{i=1} ^ {\infty} Pr(X_c\geq i)=1+ \sum_{i=1}^{\infty} Pr(X_c>i)$
- $=1+\sum_{i=1}^{\infty} (1-\frac{1}{2^c})^i = 2^c$, as expected

$E(X_c^2) = \sum_{i=1}^{\infty} Pr(X_c^2\geq i)= \sum_{i=1}^{\infty} Pr(X_c \geq i)(i^2-(i-1)^2)$

Ramesh Hariharan

Distribution of $X_c$

$E(X_c)$

- $=\sum_{i=1} ^ {\infty} Pr(X_c\geq i)=1+ \sum_{i=1}^{\infty} Pr(X_c>i)$
- $=1+\sum_{i=1}^{\infty} (1-\frac{1}{2^c})^i = 2^c$, as expected

$E(X_c^2) = \sum_{i=1}^{\infty} Pr(X_c^2\geq i)= \sum_{i=1}^{\infty} Pr(X_c \geq i)(i^2-(i-1)^2)$

- $\sum_{i=1}^{\infty} (1-\frac{1}{2^c})^{i-1}(2i-1)\leq 2.2^{2c}$

Ramesh Hariharan

Distribution of $X_c$

$E(X_c)$

- $=\sum_{i=1} ^ {\infty} Pr(X_c\geq i)=1+ \sum_{i=1}^{\infty} Pr(X_c>i)$
- $=1+\sum_{i=1}^{\infty} (1-\frac{1}{2^c})^i = 2^c$, as expected

$E(X_c^2) = \sum_{i=1}^{\infty} Pr(X_c^2\geq i)= \sum_{i=1}^{\infty} Pr(X_c \geq i)(i^2-(i-1)^2)$

- $\sum_{i=1}^{\infty} (1-\frac{1}{2^c})^{i-1}(2i-1)\leq 2.2^{2c}$
- $Var(X_c) \leq 2^{2c}$

Ramesh Hariharan

Distribution of $X_c$

$E(X_c)$

- $=\sum_{i=1} ^ {\infty} Pr(X_c\geq i)=1+ \sum_{i=1}^{\infty} Pr(X_c>i)$
- $=1+\sum_{i=1}^{\infty} (1-\frac{1}{2^c})^i = 2^c$, as expected

$E(X_c^2) = \sum_{i=1}^{\infty} Pr(X_c^2\geq i)= \sum_{i=1}^{\infty} Pr(X_c \geq i)(i^2-(i-1)^2)$

- $\sum_{i=1}^{\infty} (1-\frac{1}{2^c})^{i-1}(2i-1)\leq 2.2^{2c}$
- $Var(X_c) \leq 2^{2c}$

Ramesh Hariharan

Distribution of $X=\sum_0^{k-1} X_c$

- Recall $X_c$, how many 1's before counter increments from $c$

Ramesh Hariharan

Distribution of $X=\sum_0^{k-1} X_c$

Recall $X_c$, how many 1's before counter increments from $c$

$X$ reflects the number of 1's for the count to reach $k$ from 0

Ramesh Hariharan

Distribution of $X=\sum_0^{k-1} X_c$

Recall $X_c$, how many 1's before counter increments from $c$

$X$ reflects the number of 1's for the count to reach $k$ from 0

$E(X) = 2^{k-1}+2^{k-2}+\cdots+2^0= 2^{k}-1$

Ramesh Hariharan

Distribution of $X=\sum_0^{k-1} X_c$

Recall $X_c$, how many 1's before counter increments from $c$

$X$ reflects the number of 1's for the count to reach $k$ from 0

$E(X) = 2^{k-1}+2^{k-2}+\cdots+2^0= 2^{k}-1$

$Var(X) \leq 2^{2(k-1)}+\cdots+2^{2.0}=(4^k-1)\frac{4}{3}$, by independence

Ramesh Hariharan

Distribution of $X=\sum_0^{k-1} X_c$

Recall $X_c$, how many 1's before counter increments from $c$

$X$ reflects the number of 1's for the count to reach $k$ from 0

$E(X) = 2^{k-1}+2^{k-2}+\cdots+2^0= 2^{k}-1$

$Var(X) \leq 2^{2(k-1)}+\cdots+2^{2.0}=(4^k-1)\frac{4}{3}$, by independence

By Chebyschev's Ineq, $Pr(|X-(2^k-1)|>\epsilon (2^k-1))\leq \frac{4}{3\epsilon^2}$

Ramesh Hariharan

Distribution of $X=\sum_0^{k-1} X_c$

Recall $X_c$, how many 1's before counter increments from $c$

$X$ reflects the number of 1's for the count to reach $k$ from 0

$E(X) = 2^{k-1}+2^{k-2}+\cdots+2^0= 2^{k}-1$

$Var(X) \leq 2^{2(k-1)}+\cdots+2^{2.0}=(4^k-1)\frac{4}{3}$, by independence

By Chebyschev's Ineq, $Pr(|X-(2^k-1)|>\epsilon (2^k-1))\leq \frac{4}{3\epsilon^2}$

Meaningful only for $\epsilon\geq \sqrt{4/3}$

Ramesh Hariharan

Distribution of $X=\sum_0^{k-1} X_c$

Recall $X_c$, how many 1's before counter increments from $c$

$X$ reflects the number of 1's for the count to reach $k$ from 0

$E(X) = 2^{k-1}+2^{k-2}+\cdots+2^0= 2^{k}-1$

$Var(X) \leq 2^{2(k-1)}+\cdots+2^{2.0}=(4^k-1)\frac{4}{3}$, by independence

By Chebyschev's Ineq, $Pr(|X-(2^k-1)|>\epsilon (2^k-1))\leq \frac{4}{3\epsilon^2}$

Meaningful only for $\epsilon\geq \sqrt{4/3}$

Ramesh Hariharan

Use Many Counters?

- What if we use many counters and then average the count?

Ramesh Hariharan

Use Many Counters?

What if we use many counters and then average the count?

We need to fix the number of 1s and then study the distribution of the resulting count; let $X_n$ be the count for $n$ 1s?

Ramesh Hariharan

Use Many Counters?

What if we use many counters and then average the count?

We need to fix the number of 1s and then study the distribution of the resulting count; let $X_n$ be the count for $n$ 1s?

$E(2^{X_n})=E(2^{X_{n-1}})+1=n+1$

*$\rightarrow$ exercise*

Ramesh Hariharan

Use Many Counters?

What if we use many counters and then average the count?

We need to fix the number of 1s and then study the distribution of the resulting count; let $X_n$ be the count for $n$ 1s?

$E(2^{X_n})=E(2^{X_{n-1}})+1=n+1$

*$\rightarrow$ exercise*$E(2^{2X_n})=E(2^{2X_{n-1}})+3n=3n(n+1)/2+1$

*$\rightarrow$ exercise*

Ramesh Hariharan

Use Many Counters?

What if we use many counters and then average the count?

$E(2^{X_n})=E(2^{X_{n-1}})+1=n+1$

*$\rightarrow$ exercise*$E(2^{2X_n})=E(2^{2X_{n-1}})+3n=3n(n+1)/2+1$

*$\rightarrow$ exercise*$Var(2^{X_n})=3n(n+1)/2+1-(n+1)^2= \frac{n(n-1)}{2}$

Ramesh Hariharan

Use Many Counters?

What if we use many counters and then average the count?

$E(2^{X_n})=E(2^{X_{n-1}})+1=n+1$

*$\rightarrow$ exercise*$E(2^{2X_n})=E(2^{2X_{n-1}})+3n=3n(n+1)/2+1$

*$\rightarrow$ exercise*$Var(2^{X_n})=3n(n+1)/2+1-(n+1)^2= \frac{n(n-1)}{2}$

Ramesh Hariharan

Proof of Expectation and Variance

After $n-1$ 1s have been seen, let the prob that the counter carries value $i$ be $p_i$

After $n$ 1s, the prob that counter carries $i$ is $p_{i-1}\frac{1}{2^{i-1}}+p_{i}(1-\frac{1}{2^i})$

$E(2^X_n)=\sum_i 2^i [p_{i-1}\frac{1}{2^{i-1}}+p_i(1-\frac{1}{2^i})]=\sum_i 2^i p_i + \sum p_i = E(2^{X_{n-1}})+1$

$E(2^{2X_n})=\sum_i 2^{2i} [p_{i-1}\frac{1}{2^{i-1}}+p_i(1-\frac{1}{2^i})]=\sum_i 2^{2i} p_i + 4 \sum_i 2^{i-1} p_{i-1} - \sum 2^i p_i$ = $E(2^{2X_{n-1}})+3E(2^{X_{n-1}})=E(2^{2X_{n-1}})+3n$

Ramesh Hariharan

Using $l$ Independent Counters

- Dropping subscript $n$, $E(2^X-1)=n$, $Var(2^X-1)= \frac{n(n-1)}{2}$

Ramesh Hariharan

Using $l$ Independent Counters

Dropping subscript $n$, $E(2^X-1)=n$, $Var(2^X-1)= \frac{n(n-1)}{2}$

Many counters $Y_1,\ldots,Y_l$, $E(2^{Y_i}-1)=n$, $Var(2^{Y_i}-1)= \frac{n(n-1)}{2}$

Ramesh Hariharan

Using $l$ Independent Counters

Dropping subscript $n$, $E(2^X-1)=n$, $Var(2^X-1)= \frac{n(n-1)}{2}$

Many counters $Y_1,\ldots,Y_l$, $E(2^{Y_i}-1)=n$, $Var(2^{Y_i}-1)= \frac{n(n-1)}{2}$

Take average count: denote $2^Y-1=\frac{\sum_i (2^Y_i-1)}{l}$, $E(2^Y-1)=n$, $Var(2^Y-1)=\frac{n(n-1)}{2l}$

Ramesh Hariharan

Using $l$ Independent Counters

Dropping subscript $n$, $E(2^X-1)=n$, $Var(2^X-1)= \frac{n(n-1)}{2}$

Many counters $Y_1,\ldots,Y_l$, $E(2^{Y_i}-1)=n$, $Var(2^{Y_i}-1)= \frac{n(n-1)}{2}$

Take average count: denote $2^Y-1=\frac{\sum_i (2^Y_i-1)}{l}$, $E(2^Y-1)=n$, $Var(2^Y-1)=\frac{n(n-1)}{2l}$

By Chebyschev's Ineq, $Pr(|2^Y-1-n|>\epsilon n)\leq \frac{1}{2l\epsilon^2}$

Ramesh Hariharan

Using $l$ Independent Counters

Dropping subscript $n$, $E(2^X-1)=n$, $Var(2^X-1)= \frac{n(n-1)}{2}$

Many counters $Y_1,\ldots,Y_l$, $E(2^{Y_i}-1)=n$, $Var(2^{Y_i}-1)= \frac{n(n-1)}{2}$

Take average count: denote $2^Y-1=\frac{\sum_i (2^Y_i-1)}{l}$, $E(2^Y-1)=n$, $Var(2^Y-1)=\frac{n(n-1)}{2l}$

By Chebyschev's Ineq, $Pr(|2^Y-1-n|>\epsilon n)\leq \frac{1}{2l\epsilon^2}$

Meaningful for $\epsilon\geq \sqrt{1/2l}$, with $l\log\log n$ bits

Ramesh Hariharan

Using $l$ Independent Counters

Dropping subscript $n$, $E(2^X-1)=n$, $Var(2^X-1)= \frac{n(n-1)}{2}$

Many counters $Y_1,\ldots,Y_l$, $E(2^{Y_i}-1)=n$, $Var(2^{Y_i}-1)= \frac{n(n-1)}{2}$

By Chebyschev's Ineq, $Pr(|2^Y-1-n|>\epsilon n)\leq \frac{1}{2l\epsilon^2}$

Meaningful for $\epsilon\geq \sqrt{1/2l}$, with $l\log\log n$ bits

Ramesh Hariharan

Finding the Number of Distinct Elements

- Given a stream of numbers drawn from the set ${\cal M}=\{1,2,\ldots,M\}$, find the number of distinct items in the stream.

Ramesh Hariharan

Finding the Number of Distinct Elements

Given a stream of numbers drawn from the set ${\cal M}=\{1,2,\ldots,M\}$, find the number of distinct items in the stream.

Needs at least M bits, right? For we have to determine the subset of $\cal M$ which appears in the stream.

Ramesh Hariharan

Finding the Number of Distinct Elements

Given a stream of numbers drawn from the set ${\cal M}=\{1,2,\ldots,M\}$, find the number of distinct items in the stream.

Needs at least M bits, right? For we have to determine the subset of $\cal M$ which appears in the stream.

Claim: Can do approximately, with good probability, using $O(\log M)$ bits!

Ramesh Hariharan

A Related Toy Problem

Let $S=\{a_1,a_2,\ldots,a_k\}$ be the distinct elements drawn from ${\cal M}=\{1,\ldots,M\}$

Suppose $S$ were a random subset of $\cal M$ of size $k$

All subsets of $\cal M$ of size $k$ are equally likely candidates for $S$

We need to estimate $k=|S|$

Ramesh Hariharan

The Main Idea

Let $X= \min\ \{a|a\in S\}$

$X$ is a random variable because $S$ is a random subset of $\cal M$ of size $k$

Claim: $E(X)=\frac{M}{k+1}$.

Claim: $X$ is (modestly) tightly distributed around this expectation

With high probability, $k$ is close to $\frac{M}{X}-1$

Ramesh Hariharan

Expectation and Variance

- $E(X)=\sum_{i=1}^M Pr(X\geq i) = \sum_{i=1}^M (\frac{M-i+1}{M})^k$

Ramesh Hariharan

Expectation and Variance

- $E(X)=\sum_{i=1}^M Pr(X\geq i) = \sum_{i=1}^M (\frac{M-i+1}{M})^k$
- $ \int_{i=1}^{M+1} (\frac{M-i+1}{M})^k \mbox{d}i \cdots 1+\int_{i=1}^{M} (\frac{M-i+1}{M})^k \mbox{d}i \subseteq \frac{M}{k+1}\cdots \frac{M}{k+1}+1$

Ramesh Hariharan

Expectation and Variance

- $E(X)=\sum_{i=1}^M Pr(X\geq i) = \sum_{i=1}^M (\frac{M-i+1}{M})^k$
- $ \int_{i=1}^{M+1} (\frac{M-i+1}{M})^k \mbox{d}i \cdots 1+\int_{i=1}^{M} (\frac{M-i+1}{M})^k \mbox{d}i \subseteq \frac{M}{k+1}\cdots \frac{M}{k+1}+1$

- $E(X^2)=\sum_{i=1}^{M^2} Pr(X^2\geq i) = \sum_{i=1}^M Pr(X\geq i)(i^2-(i-1)^2)$

Ramesh Hariharan

Expectation and Variance

- $E(X)=\sum_{i=1}^M Pr(X\geq i) = \sum_{i=1}^M (\frac{M-i+1}{M})^k$
- $ \int_{i=1}^{M+1} (\frac{M-i+1}{M})^k \mbox{d}i \cdots 1+\int_{i=1}^{M} (\frac{M-i+1}{M})^k \mbox{d}i \subseteq \frac{M}{k+1}\cdots \frac{M}{k+1}+1$

- $E(X^2)=\sum_{i=1}^{M^2} Pr(X^2\geq i) = \sum_{i=1}^M Pr(X\geq i)(i^2-(i-1)^2)$
- $\sum_{i=1}^M Pr(X\geq i)(2i-1)$

Ramesh Hariharan

Expectation and Variance

- $E(X)=\sum_{i=1}^M Pr(X\geq i) = \sum_{i=1}^M (\frac{M-i+1}{M})^k$
- $E(X^2)=\sum_{i=1}^{M^2} Pr(X^2\geq i) = \sum_{i=1}^M Pr(X\geq i)(i^2-(i-1)^2)$
- $\sum_{i=1}^M Pr(X\geq i)(2i-1)$
- $\leq \int_{i=1}^{M+1} (\frac{M-i+1}{M})^k (2i) \mbox{d}i +2 -E(X)= 2\frac{M^2}{(k+1)(k+2)}+\frac{M}{k+1}+2 $

Ramesh Hariharan

Expectation and Variance

- $E(X)=\sum_{i=1}^M Pr(X\geq i) = \sum_{i=1}^M (\frac{M-i+1}{M})^k$
- $E(X^2)=\sum_{i=1}^{M^2} Pr(X^2\geq i) = \sum_{i=1}^M Pr(X\geq i)(i^2-(i-1)^2)$
- $\sum_{i=1}^M Pr(X\geq i)(2i-1)$
- $\leq \int_{i=1}^{M+1} (\frac{M-i+1}{M})^k (2i) \mbox{d}i +2 -E(X)= 2\frac{M^2}{(k+1)(k+2)}+\frac{M}{k+1}+2 $
- $Var(X)\leq 2\frac{M^2}{(k+1)(k+2)}+\frac{M}{k+1}+2-\frac{M^2}{(k+1)^2}\leq (\frac{M}{k+1}+1)^2$

Ramesh Hariharan

Expectation and Variance

- $E(X)=\sum_{i=1}^M Pr(X\geq i) = \sum_{i=1}^M (\frac{M-i+1}{M})^k$
- $E(X^2)=\sum_{i=1}^{M^2} Pr(X^2\geq i) = \sum_{i=1}^M Pr(X\geq i)(i^2-(i-1)^2)$
- $\sum_{i=1}^M Pr(X\geq i)(2i-1)$
- $\leq \int_{i=1}^{M+1} (\frac{M-i+1}{M})^k (2i) \mbox{d}i +2 -E(X)= 2\frac{M^2}{(k+1)(k+2)}+\frac{M}{k+1}+2 $
- $Var(X)\leq 2\frac{M^2}{(k+1)(k+2)}+\frac{M}{k+1}+2-\frac{M^2}{(k+1)^2}\leq (\frac{M}{k+1}+1)^2$

- By Chebyschev's Ineq, $Pr(|X-E(X)|>c E(X))\leq \frac{(\frac{M}{k+1}+1)^2}{c^2 (\frac{M}{k+1})^2}\leq \frac{2}{c^2}$

Ramesh Hariharan

But $S$ is not a Random Subset

- $S=\{a_1,a_2,\ldots,a_k\}$ is not a random subset of $ { \cal M}= \{1,2,\ldots,M\}$

Ramesh Hariharan

But $S$ is not a Random Subset

$S=\{a_1,a_2,\ldots,a_k\}$ is not a random subset of $ { \cal M}= \{1,2,\ldots,M\}$

Idea: Permute the items in $ { \cal M}= \{1,2,\ldots,M\}$ randomly using a bijective function $f:{\cal M}\rightarrow {\cal M}$

Ramesh Hariharan

But $S$ is not a Random Subset

$S=\{a_1,a_2,\ldots,a_k\}$ is not a random subset of $ { \cal M}= \{1,2,\ldots,M\}$

Idea: Permute the items in $ { \cal M}= \{1,2,\ldots,M\}$ randomly using a bijective function $f:{\cal M}\rightarrow {\cal M}$

Then $f(S)=\{f(a_1),f(a_2),\ldots,f(a_k)\}$ is a random subset of $ { \cal M}$

Ramesh Hariharan

But $S$ is not a Random Subset

$S=\{a_1,a_2,\ldots,a_k\}$ is not a random subset of $ { \cal M}= \{1,2,\ldots,M\}$

Idea: Permute the items in $ { \cal M}= \{1,2,\ldots,M\}$ randomly using a bijective function $f:{\cal M}\rightarrow {\cal M}$

Then $f(S)=\{f(a_1),f(a_2),\ldots,f(a_k)\}$ is a random subset of $ { \cal M}$

Algo: Keep track of min item after applying $f$

Ramesh Hariharan

But $S$ is not a Random Subset

$S=\{a_1,a_2,\ldots,a_k\}$ is not a random subset of $ { \cal M}= \{1,2,\ldots,M\}$

Then $f(S)=\{f(a_1),f(a_2),\ldots,f(a_k)\}$ is a random subset of $ { \cal M}$

Algo: Keep track of min item after applying $f$

Challenge: $f$ needs $M\log M$ bits; way too many!

Ramesh Hariharan

A Different Type of Permutation

- Idea: Simplify $f$. Assume: $M$ is a prime, if not estimate a prime or a prime power nearby

Ramesh Hariharan

A Different Type of Permutation

Idea: Simplify $f$. Assume: $M$ is a prime, if not estimate a prime or a prime power nearby

Pick $\alpha,\beta$ uniformly at random from ${\cal M}=\{1,\ldots,M\}$, set $f(x)=\alpha (x-1) + \beta \ \ (\bmod M)+1$ : [Only $2\log M$ bits]

Ramesh Hariharan

A Different Type of Permutation

Idea: Simplify $f$. Assume: $M$ is a prime, if not estimate a prime or a prime power nearby

Pick $\alpha,\beta$ uniformly at random from ${\cal M}=\{1,\ldots,M\}$, set $f(x)=\alpha (x-1) + \beta \ \ (\bmod M)+1$ : [Only $2\log M$ bits]

$\forall x,i\in {\cal M}, Pr(f(x))=i)= \frac{1}{M}$

Ramesh Hariharan

A Different Type of Permutation

Idea: Simplify $f$. Assume: $M$ is a prime, if not estimate a prime or a prime power nearby

Pick $\alpha,\beta$ uniformly at random from ${\cal M}=\{1,\ldots,M\}$, set $f(x)=\alpha (x-1) + \beta \ \ (\bmod M)+1$ : [Only $2\log M$ bits]

$\forall x,i\in {\cal M}, Pr(f(x))=i)= \frac{1}{M}$

$\forall x,i,y,j\in {\cal M}, Pr([f(x))=i] \land [f(y)=j])= \frac{1}{M^2}$

Ramesh Hariharan

A Different Type of Permutation

Idea: Simplify $f$. Assume: $M$ is a prime, if not estimate a prime or a prime power nearby

$\forall x,i\in {\cal M}, Pr(f(x))=i)= \frac{1}{M}$

$\forall x,i,y,j\in {\cal M}, Pr([f(x))=i] \land [f(y)=j])= \frac{1}{M^2}$

Independence stops here

Ramesh Hariharan

Performance

- $Pr(X>c\frac{M}{k+1})\leq ?$

Ramesh Hariharan

Performance

$Pr(X>c\frac{M}{k+1})\leq ?$

$X_i=0$ if $f(a_i)\leq c\frac{M}{k+1}$, $Pr(X_i=0)=\frac{c}{k+1}$

Ramesh Hariharan

Performance

$Pr(X>c\frac{M}{k+1})\leq ?$

$X_i=0$ if $f(a_i)\leq c\frac{M}{k+1}$, $Pr(X_i=0)=\frac{c}{k+1}$

$E(X_i)=1-\frac{c}{k+1}, Var(X_i)=(1-\frac{c}{k+1})\frac{c}{k+1}$

Ramesh Hariharan

Performance

$Pr(X>c\frac{M}{k+1})\leq ?$

$X_i=0$ if $f(a_i)\leq c\frac{M}{k+1}$, $Pr(X_i=0)=\frac{c}{k+1}$

$E(X_i)=1-\frac{c}{k+1}, Var(X_i)=(1-\frac{c}{k+1})\frac{c}{k+1}$

$E(\sum_{i=1}^k X_i)=k(1-\frac{c}{k+1})$, $Var(\sum_{i=1}^k X_i)=(1-\frac{c}{k+1})\frac{ck}{k+1}$, even with just pairwise ind

Ramesh Hariharan

Performance

$Pr(X>c\frac{M}{k+1})\leq ?$

$X_i=0$ if $f(a_i)\leq c\frac{M}{k+1}$, $Pr(X_i=0)=\frac{c}{k+1}$

$E(X_i)=1-\frac{c}{k+1}, Var(X_i)=(1-\frac{c}{k+1})\frac{c}{k+1}$

$E(\sum_{i=1}^k X_i)=k(1-\frac{c}{k+1})$, $Var(\sum_{i=1}^k X_i)=(1-\frac{c}{k+1})\frac{ck}{k+1}$, even with just pairwise ind

$Pr(\sum_{i=1}^k X_i\geq k)=Pr(\sum_{i=1}^k X_i-k(1-\frac{c}{k+1})\geq \frac{ck}{k+1})\leq \frac{Var(\sum_{i=1}^k X_i)}{(\frac{ck}{k+1})^2}\leq \frac{k+1}{ck} $

Ramesh Hariharan

Performance Contd.

- $Pr(X>c\frac{M}{k+1})\leq \frac{1}{c} \frac{k+1}{k}$

Ramesh Hariharan

Performance Contd.

$Pr(X>c\frac{M}{k+1})\leq \frac{1}{c} \frac{k+1}{k}$

$Pr(X<\frac{1}{c}\frac{M}{k+1})\leq \frac{1}{c}\frac{k}{k+1}$

Ramesh Hariharan

Performance Contd.

$Pr(X>c\frac{M}{k+1})\leq \frac{1}{c} \frac{k+1}{k}$

$Pr(X<\frac{1}{c}\frac{M}{k+1})\leq \frac{1}{c}\frac{k}{k+1}$

With probability $1-\frac{1}{c}\frac{k}{k+1}-\frac{1}{c}\frac{k+1}{k}$, $\frac{k+1}{c}< \frac{M}{X}< c(k+1)$

Ramesh Hariharan

Performance Contd.

$Pr(X>c\frac{M}{k+1})\leq \frac{1}{c} \frac{k+1}{k}$

$Pr(X<\frac{1}{c}\frac{M}{k+1})\leq \frac{1}{c}\frac{k}{k+1}$

With probability $1-\frac{1}{c}\frac{k}{k+1}-\frac{1}{c}\frac{k+1}{k}$, $\frac{k+1}{c}< \frac{M}{X}< c(k+1)$

Space Usage: $2\log M$ bits!

Ramesh Hariharan

Finding the Second Moment

- Given a stream of numbers drawn from the set $\{1,2,\ldots,M\}$ with frequencies $f_1,\ldots,f_M$, estimate the sum of squares of the frequencies of the individual items.

Ramesh Hariharan

Finding the Second Moment

Given a stream of numbers drawn from the set $\{1,2,\ldots,M\}$ with frequencies $f_1,\ldots,f_M$, estimate the sum of squares of the frequencies of the individual items.

Naive: Keep track of each frequency, at least $M$ words required

Ramesh Hariharan

Finding the Second Moment

Given a stream of numbers drawn from the set $\{1,2,\ldots,M\}$ with frequencies $f_1,\ldots,f_M$, estimate the sum of squares of the frequencies of the individual items.

Naive: Keep track of each frequency, at least $M$ words required

Can one do better?

Ramesh Hariharan

Finding the Second Moment

- Idea: Track $F=(\sum_{i=1}^M r_if_i)^2$, where $r_i$ are independent -1/1 random bits, each with mean 0

Ramesh Hariharan

Finding the Second Moment

Idea: Track $F=(\sum_{i=1}^M r_if_i)^2$, where $r_i$ are independent -1/1 random bits, each with mean 0

Need to store only $M$ bits, not $M$ words.

Ramesh Hariharan

Finding the Second Moment

Idea: Track $F=(\sum_{i=1}^M r_if_i)^2$, where $r_i$ are independent -1/1 random bits, each with mean 0

Need to store only $M$ bits, not $M$ words.

$E[F]=E[(\sum_{i=1}^M r_if_i)^2]=\sum_{i=1}^M f_i^2$, so $F$ is a good estimator

Ramesh Hariharan

Finding the Second Moment

Need to store only $M$ bits, not $M$ words.

$E[F]=E[(\sum_{i=1}^M r_if_i)^2]=\sum_{i=1}^M f_i^2$, so $F$ is a good estimator

$Var[F] =E[F^2]-(E[F])^2= E[(\sum_{i=1}^M r_if_i)^4]-(\sum_{i=1}^M f_i^2)^2$

Ramesh Hariharan

Finding the Second Moment

Need to store only $M$ bits, not $M$ words.

$E[F]=E[(\sum_{i=1}^M r_if_i)^2]=\sum_{i=1}^M f_i^2$, so $F$ is a good estimator

$Var[F] =E[F^2]-(E[F])^2= E[(\sum_{i=1}^M r_if_i)^4]-(\sum_{i=1}^M f_i^2)^2$

- $= \sum_{i=1}^M f_i^4 + 6\sum_{i,j=1}^M f_i^2 f_j^2 - (\sum_{i=1}^M f_i^2)^2\leq 2(\sum_{i=1}^M f_i^2)^2 = 2 (E[F])^2$

Ramesh Hariharan

Finding the Second Moment

- How do we compress Variance?

Ramesh Hariharan

Finding the Second Moment

How do we compress Variance?

Repeat with many independent sets of random bits and take the average

Ramesh Hariharan

Finding the Second Moment

How do we compress Variance?

Repeat with many independent sets of random bits and take the average

We get $F_1,\ldots,F_k$, where $F_j=(\sum_{i=1}^M r^j_if_i)^2$

Ramesh Hariharan

Finding the Second Moment

How do we compress Variance?

Repeat with many independent sets of random bits and take the average

We get $F_1,\ldots,F_k$, where $F_j=(\sum_{i=1}^M r^j_if_i)^2$

Need to store only $Mk$ bits now

Ramesh Hariharan

Finding the Second Moment

How do we compress Variance?

Repeat with many independent sets of random bits and take the average

We get $F_1,\ldots,F_k$, where $F_j=(\sum_{i=1}^M r^j_if_i)^2$

Need to store only $Mk$ bits now

$F=\frac{1}{k} \sum_{1}^k F_j$, $E[F]=\sum_{i=1}^M f_i^2$, $Var[F]=\frac{2}{k} (E[F])^2$

Ramesh Hariharan

Finding the Second Moment

How do we compress Variance?

Repeat with many independent sets of random bits and take the average

We get $F_1,\ldots,F_k$, where $F_j=(\sum_{i=1}^M r^j_if_i)^2$

Need to store only $Mk$ bits now

$F=\frac{1}{k} \sum_{1}^k F_j$, $E[F]=\sum_{i=1}^M f_i^2$, $Var[F]=\frac{2}{k} (E[F])^2$

$Prob(|F-E(F)|>\epsilon E(F))\leq \frac{Var(F)}{\epsilon^2 E(F)^2}= \frac{2}{k\epsilon^2}$

Ramesh Hariharan

Finding the Second Moment

- How do we reduce further from $Mk$ bits?

Ramesh Hariharan

Finding the Second Moment

How do we reduce further from $Mk$ bits?

Idea: The random bits $r_1,\ldots,r_M$ need be only 4-way independent for $Var[F]\leq \frac{2}{k} (E[F])^2$

Ramesh Hariharan

Finding the Second Moment

How do we reduce further from $Mk$ bits?

Idea: The random bits $r_1,\ldots,r_M$ need be only 4-way independent for $Var[F]\leq \frac{2}{k} (E[F])^2$

Earlier we assumed $M$ is a prime and worked with the field mod $M$. Slightly messy to do this now, because we want $E[r_i]=0$

Ramesh Hariharan

Finding the Second Moment

How do we reduce further from $Mk$ bits?

Idea: The random bits $r_1,\ldots,r_M$ need be only 4-way independent for $Var[F]\leq \frac{2}{k} (E[F])^2$

Earlier we assumed $M$ is a prime and worked with the field mod $M$. Slightly messy to do this now, because we want $E[r_i]=0$

Instead work with field $\cal F$ of size $2^l$ bigger than $M$ (polynomials of degree $l$ with boolean coefficients, taken modulo an irreducible polynomial $\cal P$ of degree $l$)

Ramesh Hariharan

Finding the Second Moment

How do we reduce further from $Mk$ bits?

Earlier we assumed $M$ is a prime and worked with the field mod $M$. Slightly messy to do this now, because we want $E[r_i]=0$

Instead work with field $\cal F$ of size $2^l$ bigger than $M$ (polynomials of degree $l$ with boolean coefficients, taken modulo an irreducible polynomial $\cal P$ of degree $l$)

Ramesh Hariharan

Finding the Second Moment

- Choose $\alpha,\beta,\gamma,\delta$ uniformly at random from $\cal F$

Ramesh Hariharan

Finding the Second Moment

Choose $\alpha,\beta,\gamma,\delta$ uniformly at random from $\cal F$

For each $x\in {\cal F}$, define $r_x$ as the first bit of the $l$-bit representation of $\alpha+\beta x + \gamma x^2 +\delta x^3 \in \cal F$

Ramesh Hariharan

Finding the Second Moment

Choose $\alpha,\beta,\gamma,\delta$ uniformly at random from $\cal F$

For each $x\in {\cal F}$, define $r_x$ as the first bit of the $l$-bit representation of $\alpha+\beta x + \gamma x^2 +\delta x^3 \in \cal F$

$\forall x\in {\cal F}, E(r_x)=0$ (after making $r(x)$ +/-1)

Ramesh Hariharan

Finding the Second Moment

Choose $\alpha,\beta,\gamma,\delta$ uniformly at random from $\cal F$

For each $x\in {\cal F}$, define $r_x$ as the first bit of the $l$-bit representation of $\alpha+\beta x + \gamma x^2 +\delta x^3 \in \cal F$

$\forall x\in {\cal F}, E(r_x)=0$ (after making $r(x)$ +/-1)

$Pr(r_w|r_x,r_y,r_z)=r_w$, i.e., 4-way independence

Ramesh Hariharan

Finding the Second Moment

Choose $\alpha,\beta,\gamma,\delta$ uniformly at random from $\cal F$

$\forall x\in {\cal F}, E(r_x)=0$ (after making $r(x)$ +/-1)

$Pr(r_w|r_x,r_y,r_z)=r_w$, i.e., 4-way independence

Number of bits needed is $4*l*k=4 \log M \frac{O(1)}{\epsilon^2}$

Ramesh Hariharan