
Massive Data Problems
A Few Streaming Algorithms
Data Sciences Course, Jan 2015
Ramesh Hariharan
Streaming Algorithms
Data flying past in a router for example
Can't afford to store data; operate and forget (or remember what you can in the space available)
Space is at a premium, how low can we go?
Ramesh Hariharan
Streaming Algorithms: An Example
- Given a binary stream of $n$, count the number of 1s
Ramesh Hariharan
Streaming Algorithms: An Example
Given a binary stream of $n$, count the number of 1s
$\log n$ bits sufficient? necessary?
Ramesh Hariharan
Streaming Algorithms: An Example
Given a binary stream of $n$, count the number of 1s
$\log n$ bits sufficient? necessary?
What if we want just an approximate answer? Say within a factor of 2.
Ramesh Hariharan
Streaming Algorithms: An Example
Given a binary stream of $n$, count the number of 1s
$\log n$ bits sufficient? necessary?
What if we want just an approximate answer? Say within a factor of 2.
Create doubling buckets $1\ldots 1,2 \ldots 3,4 \ldots 7,8 \ldots 15,16\ldots 31 $
Ramesh Hariharan
Streaming Algorithms: An Example
Given a binary stream of $n$, count the number of 1s
$\log n$ bits sufficient? necessary?
What if we want just an approximate answer? Say within a factor of 2.
Create doubling buckets $1\ldots 1,2 \ldots 3,4 \ldots 7,8 \ldots 15,16\ldots 31 $
Suffices to know which bucket the count lies in.
Ramesh Hariharan
Streaming Algorithms: An Example
Given a binary stream of $n$, count the number of 1s
$\log n$ bits sufficient? necessary?
What if we want just an approximate answer? Say within a factor of 2.
Create doubling buckets $1\ldots 1,2 \ldots 3,4 \ldots 7,8 \ldots 15,16\ldots 31 $
Suffices to know which bucket the count lies in.
$\log\log n$ bits sufficient to indicate a bucket. Can we also count in this space?
Ramesh Hariharan
Streaming Algorithms: An Example
Given a binary stream of $n$, count the number of 1s
$\log n$ bits sufficient? necessary?
What if we want just an approximate answer? Say within a factor of 2.
Create doubling buckets $1\ldots 1,2 \ldots 3,4 \ldots 7,8 \ldots 15,16\ldots 31 $
Suffices to know which bucket the count lies in.
$\log\log n$ bits sufficient to indicate a bucket. Can we also count in this space?
Ramesh Hariharan
Counting 1s Approximately
- The Challenge: Recognize when a bucket is done and we're moving to the next bucket, so we can increment
Ramesh Hariharan
Counting 1s Approximately
The Challenge: Recognize when a bucket is done and we're moving to the next bucket, so we can increment
For bucket $2^i\ldots (2^{i+1}-1)$ one needs $i$ bits of count to know that that bucket is over?
Ramesh Hariharan
Counting 1s Approximately
The Challenge: Recognize when a bucket is done and we're moving to the next bucket, so we can increment
For bucket $2^i\ldots (2^{i+1}-1)$ one needs $i$ bits of count to know that that bucket is over?
Use the people counting trick instead.
Ramesh Hariharan
Counting 1s Approximately
The Challenge: Recognize when a bucket is done and we're moving to the next bucket, so we can increment
For bucket $2^i\ldots (2^{i+1}-1)$ one needs $i$ bits of count to know that that bucket is over?
Use the people counting trick instead.
Ask everyone to raise their hands, then each person tosses a coin and, heads put their hands down, repeat $k$ rounds until raised hands can be counted. Inflated by $2^k$ to get an estimate of actual count.
Ramesh Hariharan
Counting 1s Approximately
The Challenge: Recognize when a bucket is done and we're moving to the next bucket, so we can increment
For bucket $2^i\ldots (2^{i+1}-1)$ one needs $i$ bits of count to know that that bucket is over?
Use the people counting trick instead.
Ask everyone to raise their hands, then each person tosses a coin and, heads put their hands down, repeat $k$ rounds until raised hands can be counted. Inflated by $2^k$ to get an estimate of actual count.
Ramesh Hariharan
Switching Buckets
- Once in bucket $2^i\ldots (2^{i+1}-1)$, with every 1 that arrives, increment the count with prob $\frac{1}{2^i}$ (how exactly?)
Ramesh Hariharan
Switching Buckets
Once in bucket $2^i\ldots (2^{i+1}-1)$, with every 1 that arrives, increment the count with prob $\frac{1}{2^i}$ (how exactly?)
In other words, for every 1, $c=c+1$ with prob $\frac{1}{2^c}$
Ramesh Hariharan
Switching Buckets
Once in bucket $2^i\ldots (2^{i+1}-1)$, with every 1 that arrives, increment the count with prob $\frac{1}{2^i}$ (how exactly?)
In other words, for every 1, $c=c+1$ with prob $\frac{1}{2^c}$
How is the number of 1s related $c$?
Ramesh Hariharan
Switching Buckets
Once in bucket $2^i\ldots (2^{i+1}-1)$, with every 1 that arrives, increment the count with prob $\frac{1}{2^i}$ (how exactly?)
In other words, for every 1, $c=c+1$ with prob $\frac{1}{2^c}$
How is the number of 1s related $c$?
Family of random variables $X_c$ corresponding to count $c$
Ramesh Hariharan
Switching Buckets
Once in bucket $2^i\ldots (2^{i+1}-1)$, with every 1 that arrives, increment the count with prob $\frac{1}{2^i}$ (how exactly?)
In other words, for every 1, $c=c+1$ with prob $\frac{1}{2^c}$
How is the number of 1s related $c$?
Family of random variables $X_c$ corresponding to count $c$
$X_c$ is the number of 1's required after the count has reached $c$ for the count to be incremented to $c+1$
Ramesh Hariharan
Switching Buckets
Once in bucket $2^i\ldots (2^{i+1}-1)$, with every 1 that arrives, increment the count with prob $\frac{1}{2^i}$ (how exactly?)
In other words, for every 1, $c=c+1$ with prob $\frac{1}{2^c}$
How is the number of 1s related $c$?
Family of random variables $X_c$ corresponding to count $c$
$X_c$ is the number of 1's required after the count has reached $c$ for the count to be incremented to $c+1$
Ramesh Hariharan
Distribution of $X_c$
- $E(X_c)$
Ramesh Hariharan
Distribution of $X_c$
- $E(X_c)$
- $=\sum_{i=1} ^ {\infty} Pr(X_c\geq i)=1+ \sum_{i=1}^{\infty} Pr(X_c>i)$
Ramesh Hariharan
Distribution of $X_c$
- $E(X_c)$
- $=\sum_{i=1} ^ {\infty} Pr(X_c\geq i)=1+ \sum_{i=1}^{\infty} Pr(X_c>i)$
- $=1+\sum_{i=1}^{\infty} (1-\frac{1}{2^c})^i = 2^c$, as expected
Ramesh Hariharan
Distribution of $X_c$
$E(X_c)$
- $=\sum_{i=1} ^ {\infty} Pr(X_c\geq i)=1+ \sum_{i=1}^{\infty} Pr(X_c>i)$
- $=1+\sum_{i=1}^{\infty} (1-\frac{1}{2^c})^i = 2^c$, as expected
$E(X_c^2) = \sum_{i=1}^{\infty} Pr(X_c^2\geq i)= \sum_{i=1}^{\infty} Pr(X_c \geq i)(i^2-(i-1)^2)$
Ramesh Hariharan
Distribution of $X_c$
$E(X_c)$
- $=\sum_{i=1} ^ {\infty} Pr(X_c\geq i)=1+ \sum_{i=1}^{\infty} Pr(X_c>i)$
- $=1+\sum_{i=1}^{\infty} (1-\frac{1}{2^c})^i = 2^c$, as expected
$E(X_c^2) = \sum_{i=1}^{\infty} Pr(X_c^2\geq i)= \sum_{i=1}^{\infty} Pr(X_c \geq i)(i^2-(i-1)^2)$
- $\sum_{i=1}^{\infty} (1-\frac{1}{2^c})^{i-1}(2i-1)\leq 2.2^{2c}$
Ramesh Hariharan
Distribution of $X_c$
$E(X_c)$
- $=\sum_{i=1} ^ {\infty} Pr(X_c\geq i)=1+ \sum_{i=1}^{\infty} Pr(X_c>i)$
- $=1+\sum_{i=1}^{\infty} (1-\frac{1}{2^c})^i = 2^c$, as expected
$E(X_c^2) = \sum_{i=1}^{\infty} Pr(X_c^2\geq i)= \sum_{i=1}^{\infty} Pr(X_c \geq i)(i^2-(i-1)^2)$
- $\sum_{i=1}^{\infty} (1-\frac{1}{2^c})^{i-1}(2i-1)\leq 2.2^{2c}$
- $Var(X_c) \leq 2^{2c}$
Ramesh Hariharan
Distribution of $X_c$
$E(X_c)$
- $=\sum_{i=1} ^ {\infty} Pr(X_c\geq i)=1+ \sum_{i=1}^{\infty} Pr(X_c>i)$
- $=1+\sum_{i=1}^{\infty} (1-\frac{1}{2^c})^i = 2^c$, as expected
$E(X_c^2) = \sum_{i=1}^{\infty} Pr(X_c^2\geq i)= \sum_{i=1}^{\infty} Pr(X_c \geq i)(i^2-(i-1)^2)$
- $\sum_{i=1}^{\infty} (1-\frac{1}{2^c})^{i-1}(2i-1)\leq 2.2^{2c}$
- $Var(X_c) \leq 2^{2c}$
Ramesh Hariharan
Distribution of $X=\sum_0^{k-1} X_c$
- Recall $X_c$, how many 1's before counter increments from $c$
Ramesh Hariharan
Distribution of $X=\sum_0^{k-1} X_c$
Recall $X_c$, how many 1's before counter increments from $c$
$X$ reflects the number of 1's for the count to reach $k$ from 0
Ramesh Hariharan
Distribution of $X=\sum_0^{k-1} X_c$
Recall $X_c$, how many 1's before counter increments from $c$
$X$ reflects the number of 1's for the count to reach $k$ from 0
$E(X) = 2^{k-1}+2^{k-2}+\cdots+2^0= 2^{k}-1$
Ramesh Hariharan
Distribution of $X=\sum_0^{k-1} X_c$
Recall $X_c$, how many 1's before counter increments from $c$
$X$ reflects the number of 1's for the count to reach $k$ from 0
$E(X) = 2^{k-1}+2^{k-2}+\cdots+2^0= 2^{k}-1$
$Var(X) \leq 2^{2(k-1)}+\cdots+2^{2.0}=(4^k-1)\frac{4}{3}$, by independence
Ramesh Hariharan
Distribution of $X=\sum_0^{k-1} X_c$
Recall $X_c$, how many 1's before counter increments from $c$
$X$ reflects the number of 1's for the count to reach $k$ from 0
$E(X) = 2^{k-1}+2^{k-2}+\cdots+2^0= 2^{k}-1$
$Var(X) \leq 2^{2(k-1)}+\cdots+2^{2.0}=(4^k-1)\frac{4}{3}$, by independence
By Chebyschev's Ineq, $Pr(|X-(2^k-1)|>\epsilon (2^k-1))\leq \frac{4}{3\epsilon^2}$
Ramesh Hariharan
Distribution of $X=\sum_0^{k-1} X_c$
Recall $X_c$, how many 1's before counter increments from $c$
$X$ reflects the number of 1's for the count to reach $k$ from 0
$E(X) = 2^{k-1}+2^{k-2}+\cdots+2^0= 2^{k}-1$
$Var(X) \leq 2^{2(k-1)}+\cdots+2^{2.0}=(4^k-1)\frac{4}{3}$, by independence
By Chebyschev's Ineq, $Pr(|X-(2^k-1)|>\epsilon (2^k-1))\leq \frac{4}{3\epsilon^2}$
Meaningful only for $\epsilon\geq \sqrt{4/3}$
Ramesh Hariharan
Distribution of $X=\sum_0^{k-1} X_c$
Recall $X_c$, how many 1's before counter increments from $c$
$X$ reflects the number of 1's for the count to reach $k$ from 0
$E(X) = 2^{k-1}+2^{k-2}+\cdots+2^0= 2^{k}-1$
$Var(X) \leq 2^{2(k-1)}+\cdots+2^{2.0}=(4^k-1)\frac{4}{3}$, by independence
By Chebyschev's Ineq, $Pr(|X-(2^k-1)|>\epsilon (2^k-1))\leq \frac{4}{3\epsilon^2}$
Meaningful only for $\epsilon\geq \sqrt{4/3}$
Ramesh Hariharan
Use Many Counters?
- What if we use many counters and then average the count?
Ramesh Hariharan
Use Many Counters?
What if we use many counters and then average the count?
We need to fix the number of 1s and then study the distribution of the resulting count; let $X_n$ be the count for $n$ 1s?
Ramesh Hariharan
Use Many Counters?
What if we use many counters and then average the count?
We need to fix the number of 1s and then study the distribution of the resulting count; let $X_n$ be the count for $n$ 1s?
$E(2^{X_n})=E(2^{X_{n-1}})+1=n+1$ $\rightarrow$ exercise
Ramesh Hariharan
Use Many Counters?
What if we use many counters and then average the count?
We need to fix the number of 1s and then study the distribution of the resulting count; let $X_n$ be the count for $n$ 1s?
$E(2^{X_n})=E(2^{X_{n-1}})+1=n+1$ $\rightarrow$ exercise
$E(2^{2X_n})=E(2^{2X_{n-1}})+3n=3n(n+1)/2+1$ $\rightarrow$ exercise
Ramesh Hariharan
Use Many Counters?
What if we use many counters and then average the count?
We need to fix the number of 1s and then study the distribution of the resulting count; let $X_n$ be the count for $n$ 1s?
$E(2^{X_n})=E(2^{X_{n-1}})+1=n+1$ $\rightarrow$ exercise
$E(2^{2X_n})=E(2^{2X_{n-1}})+3n=3n(n+1)/2+1$ $\rightarrow$ exercise
$Var(2^{X_n})=3n(n+1)/2+1-(n+1)^2= \frac{n(n-1)}{2}$
Ramesh Hariharan
Use Many Counters?
What if we use many counters and then average the count?
We need to fix the number of 1s and then study the distribution of the resulting count; let $X_n$ be the count for $n$ 1s?
$E(2^{X_n})=E(2^{X_{n-1}})+1=n+1$ $\rightarrow$ exercise
$E(2^{2X_n})=E(2^{2X_{n-1}})+3n=3n(n+1)/2+1$ $\rightarrow$ exercise
$Var(2^{X_n})=3n(n+1)/2+1-(n+1)^2= \frac{n(n-1)}{2}$
Ramesh Hariharan
Proof of Expectation and Variance
After $n-1$ 1s have been seen, let the prob that the counter carries value $i$ be $p_i$
After $n$ 1s, the prob that counter carries $i$ is $p_{i-1}\frac{1}{2^{i-1}}+p_{i}(1-\frac{1}{2^i})$
$E(2^X_n)=\sum_i 2^i [p_{i-1}\frac{1}{2^{i-1}}+p_i(1-\frac{1}{2^i})]=\sum_i 2^i p_i + \sum p_i = E(2^{X_{n-1}})+1$
$E(2^{2X_n})=\sum_i 2^{2i} [p_{i-1}\frac{1}{2^{i-1}}+p_i(1-\frac{1}{2^i})]=\sum_i 2^{2i} p_i + 4 \sum_i 2^{i-1} p_{i-1} - \sum 2^i p_i$ = $E(2^{2X_{n-1}})+3E(2^{X_{n-1}})=E(2^{2X_{n-1}})+3n$
Ramesh Hariharan
Using $l$ Independent Counters
- Dropping subscript $n$, $E(2^X-1)=n$, $Var(2^X-1)= \frac{n(n-1)}{2}$
Ramesh Hariharan
Using $l$ Independent Counters
Dropping subscript $n$, $E(2^X-1)=n$, $Var(2^X-1)= \frac{n(n-1)}{2}$
Many counters $Y_1,\ldots,Y_l$, $E(2^{Y_i}-1)=n$, $Var(2^{Y_i}-1)= \frac{n(n-1)}{2}$
Ramesh Hariharan
Using $l$ Independent Counters
Dropping subscript $n$, $E(2^X-1)=n$, $Var(2^X-1)= \frac{n(n-1)}{2}$
Many counters $Y_1,\ldots,Y_l$, $E(2^{Y_i}-1)=n$, $Var(2^{Y_i}-1)= \frac{n(n-1)}{2}$
Take average count: denote $2^Y-1=\frac{\sum_i (2^Y_i-1)}{l}$, $E(2^Y-1)=n$, $Var(2^Y-1)=\frac{n(n-1)}{2l}$
Ramesh Hariharan
Using $l$ Independent Counters
Dropping subscript $n$, $E(2^X-1)=n$, $Var(2^X-1)= \frac{n(n-1)}{2}$
Many counters $Y_1,\ldots,Y_l$, $E(2^{Y_i}-1)=n$, $Var(2^{Y_i}-1)= \frac{n(n-1)}{2}$
Take average count: denote $2^Y-1=\frac{\sum_i (2^Y_i-1)}{l}$, $E(2^Y-1)=n$, $Var(2^Y-1)=\frac{n(n-1)}{2l}$
By Chebyschev's Ineq, $Pr(|2^Y-1-n|>\epsilon n)\leq \frac{1}{2l\epsilon^2}$
Ramesh Hariharan
Using $l$ Independent Counters
Dropping subscript $n$, $E(2^X-1)=n$, $Var(2^X-1)= \frac{n(n-1)}{2}$
Many counters $Y_1,\ldots,Y_l$, $E(2^{Y_i}-1)=n$, $Var(2^{Y_i}-1)= \frac{n(n-1)}{2}$
Take average count: denote $2^Y-1=\frac{\sum_i (2^Y_i-1)}{l}$, $E(2^Y-1)=n$, $Var(2^Y-1)=\frac{n(n-1)}{2l}$
By Chebyschev's Ineq, $Pr(|2^Y-1-n|>\epsilon n)\leq \frac{1}{2l\epsilon^2}$
Meaningful for $\epsilon\geq \sqrt{1/2l}$, with $l\log\log n$ bits
Ramesh Hariharan
Using $l$ Independent Counters
Dropping subscript $n$, $E(2^X-1)=n$, $Var(2^X-1)= \frac{n(n-1)}{2}$
Many counters $Y_1,\ldots,Y_l$, $E(2^{Y_i}-1)=n$, $Var(2^{Y_i}-1)= \frac{n(n-1)}{2}$
Take average count: denote $2^Y-1=\frac{\sum_i (2^Y_i-1)}{l}$, $E(2^Y-1)=n$, $Var(2^Y-1)=\frac{n(n-1)}{2l}$
By Chebyschev's Ineq, $Pr(|2^Y-1-n|>\epsilon n)\leq \frac{1}{2l\epsilon^2}$
Meaningful for $\epsilon\geq \sqrt{1/2l}$, with $l\log\log n$ bits
Ramesh Hariharan
Finding the Number of Distinct Elements
- Given a stream of numbers drawn from the set ${\cal M}=\{1,2,\ldots,M\}$, find the number of distinct items in the stream.
Ramesh Hariharan
Finding the Number of Distinct Elements
Given a stream of numbers drawn from the set ${\cal M}=\{1,2,\ldots,M\}$, find the number of distinct items in the stream.
Needs at least M bits, right? For we have to determine the subset of $\cal M$ which appears in the stream.
Ramesh Hariharan
Finding the Number of Distinct Elements
Given a stream of numbers drawn from the set ${\cal M}=\{1,2,\ldots,M\}$, find the number of distinct items in the stream.
Needs at least M bits, right? For we have to determine the subset of $\cal M$ which appears in the stream.
Claim: Can do approximately, with good probability, using $O(\log M)$ bits!
Ramesh Hariharan
A Related Toy Problem
Let $S=\{a_1,a_2,\ldots,a_k\}$ be the distinct elements drawn from ${\cal M}=\{1,\ldots,M\}$
Suppose $S$ were a random subset of $\cal M$ of size $k$
All subsets of $\cal M$ of size $k$ are equally likely candidates for $S$
We need to estimate $k=|S|$
Ramesh Hariharan
The Main Idea
Let $X= \min\ \{a|a\in S\}$
$X$ is a random variable because $S$ is a random subset of $\cal M$ of size $k$
Claim: $E(X)=\frac{M}{k+1}$.
Claim: $X$ is (modestly) tightly distributed around this expectation
With high probability, $k$ is close to $\frac{M}{X}-1$
Ramesh Hariharan
Expectation and Variance
- $E(X)=\sum_{i=1}^M Pr(X\geq i) = \sum_{i=1}^M (\frac{M-i+1}{M})^k$
Ramesh Hariharan
Expectation and Variance
- $E(X)=\sum_{i=1}^M Pr(X\geq i) = \sum_{i=1}^M (\frac{M-i+1}{M})^k$
- $ \int_{i=1}^{M+1} (\frac{M-i+1}{M})^k \mbox{d}i \cdots 1+\int_{i=1}^{M} (\frac{M-i+1}{M})^k \mbox{d}i \subseteq \frac{M}{k+1}\cdots \frac{M}{k+1}+1$
Ramesh Hariharan
Expectation and Variance
- $E(X)=\sum_{i=1}^M Pr(X\geq i) = \sum_{i=1}^M (\frac{M-i+1}{M})^k$
- $ \int_{i=1}^{M+1} (\frac{M-i+1}{M})^k \mbox{d}i \cdots 1+\int_{i=1}^{M} (\frac{M-i+1}{M})^k \mbox{d}i \subseteq \frac{M}{k+1}\cdots \frac{M}{k+1}+1$
- $E(X^2)=\sum_{i=1}^{M^2} Pr(X^2\geq i) = \sum_{i=1}^M Pr(X\geq i)(i^2-(i-1)^2)$
Ramesh Hariharan
Expectation and Variance
- $E(X)=\sum_{i=1}^M Pr(X\geq i) = \sum_{i=1}^M (\frac{M-i+1}{M})^k$
- $ \int_{i=1}^{M+1} (\frac{M-i+1}{M})^k \mbox{d}i \cdots 1+\int_{i=1}^{M} (\frac{M-i+1}{M})^k \mbox{d}i \subseteq \frac{M}{k+1}\cdots \frac{M}{k+1}+1$
- $E(X^2)=\sum_{i=1}^{M^2} Pr(X^2\geq i) = \sum_{i=1}^M Pr(X\geq i)(i^2-(i-1)^2)$
- $\sum_{i=1}^M Pr(X\geq i)(2i-1)$
Ramesh Hariharan
Expectation and Variance
- $E(X)=\sum_{i=1}^M Pr(X\geq i) = \sum_{i=1}^M (\frac{M-i+1}{M})^k$
- $ \int_{i=1}^{M+1} (\frac{M-i+1}{M})^k \mbox{d}i \cdots 1+\int_{i=1}^{M} (\frac{M-i+1}{M})^k \mbox{d}i \subseteq \frac{M}{k+1}\cdots \frac{M}{k+1}+1$
- $E(X^2)=\sum_{i=1}^{M^2} Pr(X^2\geq i) = \sum_{i=1}^M Pr(X\geq i)(i^2-(i-1)^2)$
- $\sum_{i=1}^M Pr(X\geq i)(2i-1)$
- $\leq \int_{i=1}^{M+1} (\frac{M-i+1}{M})^k (2i) \mbox{d}i +2 -E(X)= 2\frac{M^2}{(k+1)(k+2)}+\frac{M}{k+1}+2 $
Ramesh Hariharan
Expectation and Variance
- $E(X)=\sum_{i=1}^M Pr(X\geq i) = \sum_{i=1}^M (\frac{M-i+1}{M})^k$
- $ \int_{i=1}^{M+1} (\frac{M-i+1}{M})^k \mbox{d}i \cdots 1+\int_{i=1}^{M} (\frac{M-i+1}{M})^k \mbox{d}i \subseteq \frac{M}{k+1}\cdots \frac{M}{k+1}+1$
- $E(X^2)=\sum_{i=1}^{M^2} Pr(X^2\geq i) = \sum_{i=1}^M Pr(X\geq i)(i^2-(i-1)^2)$
- $\sum_{i=1}^M Pr(X\geq i)(2i-1)$
- $\leq \int_{i=1}^{M+1} (\frac{M-i+1}{M})^k (2i) \mbox{d}i +2 -E(X)= 2\frac{M^2}{(k+1)(k+2)}+\frac{M}{k+1}+2 $
- $Var(X)\leq 2\frac{M^2}{(k+1)(k+2)}+\frac{M}{k+1}+2-\frac{M^2}{(k+1)^2}\leq (\frac{M}{k+1}+1)^2$
Ramesh Hariharan
Expectation and Variance
- $E(X)=\sum_{i=1}^M Pr(X\geq i) = \sum_{i=1}^M (\frac{M-i+1}{M})^k$
- $ \int_{i=1}^{M+1} (\frac{M-i+1}{M})^k \mbox{d}i \cdots 1+\int_{i=1}^{M} (\frac{M-i+1}{M})^k \mbox{d}i \subseteq \frac{M}{k+1}\cdots \frac{M}{k+1}+1$
- $E(X^2)=\sum_{i=1}^{M^2} Pr(X^2\geq i) = \sum_{i=1}^M Pr(X\geq i)(i^2-(i-1)^2)$
- $\sum_{i=1}^M Pr(X\geq i)(2i-1)$
- $\leq \int_{i=1}^{M+1} (\frac{M-i+1}{M})^k (2i) \mbox{d}i +2 -E(X)= 2\frac{M^2}{(k+1)(k+2)}+\frac{M}{k+1}+2 $
- $Var(X)\leq 2\frac{M^2}{(k+1)(k+2)}+\frac{M}{k+1}+2-\frac{M^2}{(k+1)^2}\leq (\frac{M}{k+1}+1)^2$
- By Chebyschev's Ineq, $Pr(|X-E(X)|>c E(X))\leq \frac{(\frac{M}{k+1}+1)^2}{c^2 (\frac{M}{k+1})^2}\leq \frac{2}{c^2}$
Ramesh Hariharan
But $S$ is not a Random Subset
- $S=\{a_1,a_2,\ldots,a_k\}$ is not a random subset of $ { \cal M}= \{1,2,\ldots,M\}$
Ramesh Hariharan
But $S$ is not a Random Subset
$S=\{a_1,a_2,\ldots,a_k\}$ is not a random subset of $ { \cal M}= \{1,2,\ldots,M\}$
Idea: Permute the items in $ { \cal M}= \{1,2,\ldots,M\}$ randomly using a bijective function $f:{\cal M}\rightarrow {\cal M}$
Ramesh Hariharan
But $S$ is not a Random Subset
$S=\{a_1,a_2,\ldots,a_k\}$ is not a random subset of $ { \cal M}= \{1,2,\ldots,M\}$
Idea: Permute the items in $ { \cal M}= \{1,2,\ldots,M\}$ randomly using a bijective function $f:{\cal M}\rightarrow {\cal M}$
Then $f(S)=\{f(a_1),f(a_2),\ldots,f(a_k)\}$ is a random subset of $ { \cal M}$
Ramesh Hariharan
But $S$ is not a Random Subset
$S=\{a_1,a_2,\ldots,a_k\}$ is not a random subset of $ { \cal M}= \{1,2,\ldots,M\}$
Idea: Permute the items in $ { \cal M}= \{1,2,\ldots,M\}$ randomly using a bijective function $f:{\cal M}\rightarrow {\cal M}$
Then $f(S)=\{f(a_1),f(a_2),\ldots,f(a_k)\}$ is a random subset of $ { \cal M}$
Algo: Keep track of min item after applying $f$
Ramesh Hariharan
But $S$ is not a Random Subset
$S=\{a_1,a_2,\ldots,a_k\}$ is not a random subset of $ { \cal M}= \{1,2,\ldots,M\}$
Idea: Permute the items in $ { \cal M}= \{1,2,\ldots,M\}$ randomly using a bijective function $f:{\cal M}\rightarrow {\cal M}$
Then $f(S)=\{f(a_1),f(a_2),\ldots,f(a_k)\}$ is a random subset of $ { \cal M}$
Algo: Keep track of min item after applying $f$
Challenge: $f$ needs $M\log M$ bits; way too many!
Ramesh Hariharan
A Different Type of Permutation
- Idea: Simplify $f$. Assume: $M$ is a prime, if not estimate a prime or a prime power nearby
Ramesh Hariharan
A Different Type of Permutation
Idea: Simplify $f$. Assume: $M$ is a prime, if not estimate a prime or a prime power nearby
Pick $\alpha,\beta$ uniformly at random from ${\cal M}=\{1,\ldots,M\}$, set $f(x)=\alpha (x-1) + \beta \ \ (\bmod M)+1$ : [Only $2\log M$ bits]
Ramesh Hariharan
A Different Type of Permutation
Idea: Simplify $f$. Assume: $M$ is a prime, if not estimate a prime or a prime power nearby
Pick $\alpha,\beta$ uniformly at random from ${\cal M}=\{1,\ldots,M\}$, set $f(x)=\alpha (x-1) + \beta \ \ (\bmod M)+1$ : [Only $2\log M$ bits]
$\forall x,i\in {\cal M}, Pr(f(x))=i)= \frac{1}{M}$
Ramesh Hariharan
A Different Type of Permutation
Idea: Simplify $f$. Assume: $M$ is a prime, if not estimate a prime or a prime power nearby
Pick $\alpha,\beta$ uniformly at random from ${\cal M}=\{1,\ldots,M\}$, set $f(x)=\alpha (x-1) + \beta \ \ (\bmod M)+1$ : [Only $2\log M$ bits]
$\forall x,i\in {\cal M}, Pr(f(x))=i)= \frac{1}{M}$
$\forall x,i,y,j\in {\cal M}, Pr([f(x))=i] \land [f(y)=j])= \frac{1}{M^2}$
Ramesh Hariharan
A Different Type of Permutation
Idea: Simplify $f$. Assume: $M$ is a prime, if not estimate a prime or a prime power nearby
Pick $\alpha,\beta$ uniformly at random from ${\cal M}=\{1,\ldots,M\}$, set $f(x)=\alpha (x-1) + \beta \ \ (\bmod M)+1$ : [Only $2\log M$ bits]
$\forall x,i\in {\cal M}, Pr(f(x))=i)= \frac{1}{M}$
$\forall x,i,y,j\in {\cal M}, Pr([f(x))=i] \land [f(y)=j])= \frac{1}{M^2}$
Independence stops here
Ramesh Hariharan
Performance
- $Pr(X>c\frac{M}{k+1})\leq ?$
Ramesh Hariharan
Performance
$Pr(X>c\frac{M}{k+1})\leq ?$
$X_i=0$ if $f(a_i)\leq c\frac{M}{k+1}$, $Pr(X_i=0)=\frac{c}{k+1}$
Ramesh Hariharan
Performance
$Pr(X>c\frac{M}{k+1})\leq ?$
$X_i=0$ if $f(a_i)\leq c\frac{M}{k+1}$, $Pr(X_i=0)=\frac{c}{k+1}$
$E(X_i)=1-\frac{c}{k+1}, Var(X_i)=(1-\frac{c}{k+1})\frac{c}{k+1}$
Ramesh Hariharan
Performance
$Pr(X>c\frac{M}{k+1})\leq ?$
$X_i=0$ if $f(a_i)\leq c\frac{M}{k+1}$, $Pr(X_i=0)=\frac{c}{k+1}$
$E(X_i)=1-\frac{c}{k+1}, Var(X_i)=(1-\frac{c}{k+1})\frac{c}{k+1}$
$E(\sum_{i=1}^k X_i)=k(1-\frac{c}{k+1})$, $Var(\sum_{i=1}^k X_i)=(1-\frac{c}{k+1})\frac{ck}{k+1}$, even with just pairwise ind
Ramesh Hariharan
Performance
$Pr(X>c\frac{M}{k+1})\leq ?$
$X_i=0$ if $f(a_i)\leq c\frac{M}{k+1}$, $Pr(X_i=0)=\frac{c}{k+1}$
$E(X_i)=1-\frac{c}{k+1}, Var(X_i)=(1-\frac{c}{k+1})\frac{c}{k+1}$
$E(\sum_{i=1}^k X_i)=k(1-\frac{c}{k+1})$, $Var(\sum_{i=1}^k X_i)=(1-\frac{c}{k+1})\frac{ck}{k+1}$, even with just pairwise ind
$Pr(\sum_{i=1}^k X_i\geq k)=Pr(\sum_{i=1}^k X_i-k(1-\frac{c}{k+1})\geq \frac{ck}{k+1})\leq \frac{Var(\sum_{i=1}^k X_i)}{(\frac{ck}{k+1})^2}\leq \frac{k+1}{ck} $
Ramesh Hariharan
Performance Contd.
- $Pr(X>c\frac{M}{k+1})\leq \frac{1}{c} \frac{k+1}{k}$
Ramesh Hariharan
Performance Contd.
$Pr(X>c\frac{M}{k+1})\leq \frac{1}{c} \frac{k+1}{k}$
$Pr(X<\frac{1}{c}\frac{M}{k+1})\leq \frac{1}{c}\frac{k}{k+1}$
Ramesh Hariharan
Performance Contd.
$Pr(X>c\frac{M}{k+1})\leq \frac{1}{c} \frac{k+1}{k}$
$Pr(X<\frac{1}{c}\frac{M}{k+1})\leq \frac{1}{c}\frac{k}{k+1}$
With probability $1-\frac{1}{c}\frac{k}{k+1}-\frac{1}{c}\frac{k+1}{k}$, $\frac{k+1}{c}< \frac{M}{X}< c(k+1)$
Ramesh Hariharan
Performance Contd.
$Pr(X>c\frac{M}{k+1})\leq \frac{1}{c} \frac{k+1}{k}$
$Pr(X<\frac{1}{c}\frac{M}{k+1})\leq \frac{1}{c}\frac{k}{k+1}$
With probability $1-\frac{1}{c}\frac{k}{k+1}-\frac{1}{c}\frac{k+1}{k}$, $\frac{k+1}{c}< \frac{M}{X}< c(k+1)$
Space Usage: $2\log M$ bits!
Ramesh Hariharan
Finding the Second Moment
- Given a stream of numbers drawn from the set $\{1,2,\ldots,M\}$ with frequencies $f_1,\ldots,f_M$, estimate the sum of squares of the frequencies of the individual items.
Ramesh Hariharan
Finding the Second Moment
Given a stream of numbers drawn from the set $\{1,2,\ldots,M\}$ with frequencies $f_1,\ldots,f_M$, estimate the sum of squares of the frequencies of the individual items.
Naive: Keep track of each frequency, at least $M$ words required
Ramesh Hariharan
Finding the Second Moment
Given a stream of numbers drawn from the set $\{1,2,\ldots,M\}$ with frequencies $f_1,\ldots,f_M$, estimate the sum of squares of the frequencies of the individual items.
Naive: Keep track of each frequency, at least $M$ words required
Can one do better?
Ramesh Hariharan
Finding the Second Moment
- Idea: Track $F=(\sum_{i=1}^M r_if_i)^2$, where $r_i$ are independent -1/1 random bits, each with mean 0
Ramesh Hariharan
Finding the Second Moment
Idea: Track $F=(\sum_{i=1}^M r_if_i)^2$, where $r_i$ are independent -1/1 random bits, each with mean 0
Need to store only $M$ bits, not $M$ words.
Ramesh Hariharan
Finding the Second Moment
Idea: Track $F=(\sum_{i=1}^M r_if_i)^2$, where $r_i$ are independent -1/1 random bits, each with mean 0
Need to store only $M$ bits, not $M$ words.
$E[F]=E[(\sum_{i=1}^M r_if_i)^2]=\sum_{i=1}^M f_i^2$, so $F$ is a good estimator
Ramesh Hariharan
Finding the Second Moment
Idea: Track $F=(\sum_{i=1}^M r_if_i)^2$, where $r_i$ are independent -1/1 random bits, each with mean 0
Need to store only $M$ bits, not $M$ words.
$E[F]=E[(\sum_{i=1}^M r_if_i)^2]=\sum_{i=1}^M f_i^2$, so $F$ is a good estimator
$Var[F] =E[F^2]-(E[F])^2= E[(\sum_{i=1}^M r_if_i)^4]-(\sum_{i=1}^M f_i^2)^2$
Ramesh Hariharan
Finding the Second Moment
Idea: Track $F=(\sum_{i=1}^M r_if_i)^2$, where $r_i$ are independent -1/1 random bits, each with mean 0
Need to store only $M$ bits, not $M$ words.
$E[F]=E[(\sum_{i=1}^M r_if_i)^2]=\sum_{i=1}^M f_i^2$, so $F$ is a good estimator
$Var[F] =E[F^2]-(E[F])^2= E[(\sum_{i=1}^M r_if_i)^4]-(\sum_{i=1}^M f_i^2)^2$
- $= \sum_{i=1}^M f_i^4 + 6\sum_{i,j=1}^M f_i^2 f_j^2 - (\sum_{i=1}^M f_i^2)^2\leq 2(\sum_{i=1}^M f_i^2)^2 = 2 (E[F])^2$
Ramesh Hariharan
Finding the Second Moment
- How do we compress Variance?
Ramesh Hariharan
Finding the Second Moment
How do we compress Variance?
Repeat with many independent sets of random bits and take the average
Ramesh Hariharan
Finding the Second Moment
How do we compress Variance?
Repeat with many independent sets of random bits and take the average
We get $F_1,\ldots,F_k$, where $F_j=(\sum_{i=1}^M r^j_if_i)^2$
Ramesh Hariharan
Finding the Second Moment
How do we compress Variance?
Repeat with many independent sets of random bits and take the average
We get $F_1,\ldots,F_k$, where $F_j=(\sum_{i=1}^M r^j_if_i)^2$
Need to store only $Mk$ bits now
Ramesh Hariharan
Finding the Second Moment
How do we compress Variance?
Repeat with many independent sets of random bits and take the average
We get $F_1,\ldots,F_k$, where $F_j=(\sum_{i=1}^M r^j_if_i)^2$
Need to store only $Mk$ bits now
$F=\frac{1}{k} \sum_{1}^k F_j$, $E[F]=\sum_{i=1}^M f_i^2$, $Var[F]=\frac{2}{k} (E[F])^2$
Ramesh Hariharan
Finding the Second Moment
How do we compress Variance?
Repeat with many independent sets of random bits and take the average
We get $F_1,\ldots,F_k$, where $F_j=(\sum_{i=1}^M r^j_if_i)^2$
Need to store only $Mk$ bits now
$F=\frac{1}{k} \sum_{1}^k F_j$, $E[F]=\sum_{i=1}^M f_i^2$, $Var[F]=\frac{2}{k} (E[F])^2$
$Prob(|F-E(F)|>\epsilon E(F))\leq \frac{Var(F)}{\epsilon^2 E(F)^2}= \frac{2}{k\epsilon^2}$
Ramesh Hariharan
Finding the Second Moment
- How do we reduce further from $Mk$ bits?
Ramesh Hariharan
Finding the Second Moment
How do we reduce further from $Mk$ bits?
Idea: The random bits $r_1,\ldots,r_M$ need be only 4-way independent for $Var[F]\leq \frac{2}{k} (E[F])^2$
Ramesh Hariharan
Finding the Second Moment
How do we reduce further from $Mk$ bits?
Idea: The random bits $r_1,\ldots,r_M$ need be only 4-way independent for $Var[F]\leq \frac{2}{k} (E[F])^2$
Earlier we assumed $M$ is a prime and worked with the field mod $M$. Slightly messy to do this now, because we want $E[r_i]=0$
Ramesh Hariharan
Finding the Second Moment
How do we reduce further from $Mk$ bits?
Idea: The random bits $r_1,\ldots,r_M$ need be only 4-way independent for $Var[F]\leq \frac{2}{k} (E[F])^2$
Earlier we assumed $M$ is a prime and worked with the field mod $M$. Slightly messy to do this now, because we want $E[r_i]=0$
Instead work with field $\cal F$ of size $2^l$ bigger than $M$ (polynomials of degree $l$ with boolean coefficients, taken modulo an irreducible polynomial $\cal P$ of degree $l$)
Ramesh Hariharan
Finding the Second Moment
How do we reduce further from $Mk$ bits?
Idea: The random bits $r_1,\ldots,r_M$ need be only 4-way independent for $Var[F]\leq \frac{2}{k} (E[F])^2$
Earlier we assumed $M$ is a prime and worked with the field mod $M$. Slightly messy to do this now, because we want $E[r_i]=0$
Instead work with field $\cal F$ of size $2^l$ bigger than $M$ (polynomials of degree $l$ with boolean coefficients, taken modulo an irreducible polynomial $\cal P$ of degree $l$)
Ramesh Hariharan
Finding the Second Moment
- Choose $\alpha,\beta,\gamma,\delta$ uniformly at random from $\cal F$
Ramesh Hariharan
Finding the Second Moment
Choose $\alpha,\beta,\gamma,\delta$ uniformly at random from $\cal F$
For each $x\in {\cal F}$, define $r_x$ as the first bit of the $l$-bit representation of $\alpha+\beta x + \gamma x^2 +\delta x^3 \in \cal F$
Ramesh Hariharan
Finding the Second Moment
Choose $\alpha,\beta,\gamma,\delta$ uniformly at random from $\cal F$
For each $x\in {\cal F}$, define $r_x$ as the first bit of the $l$-bit representation of $\alpha+\beta x + \gamma x^2 +\delta x^3 \in \cal F$
$\forall x\in {\cal F}, E(r_x)=0$ (after making $r(x)$ +/-1)
Ramesh Hariharan
Finding the Second Moment
Choose $\alpha,\beta,\gamma,\delta$ uniformly at random from $\cal F$
For each $x\in {\cal F}$, define $r_x$ as the first bit of the $l$-bit representation of $\alpha+\beta x + \gamma x^2 +\delta x^3 \in \cal F$
$\forall x\in {\cal F}, E(r_x)=0$ (after making $r(x)$ +/-1)
$Pr(r_w|r_x,r_y,r_z)=r_w$, i.e., 4-way independence
Ramesh Hariharan
Finding the Second Moment
Choose $\alpha,\beta,\gamma,\delta$ uniformly at random from $\cal F$
For each $x\in {\cal F}$, define $r_x$ as the first bit of the $l$-bit representation of $\alpha+\beta x + \gamma x^2 +\delta x^3 \in \cal F$
$\forall x\in {\cal F}, E(r_x)=0$ (after making $r(x)$ +/-1)
$Pr(r_w|r_x,r_y,r_z)=r_w$, i.e., 4-way independence
Number of bits needed is $4*l*k=4 \log M \frac{O(1)}{\epsilon^2}$
Ramesh Hariharan