# Fullscreen View

The author recommends viewing the raindrop in fullsceen.
×

# Raindrop Info

Title:
Massive Data Algorithms
Authors:
Created:
2015-Jan-23
Updated:
2015-Feb-24
Views:
787
Pages:
102
Likes:
5
×

# Copy the link below to embed!

×
1

Massive Data Problems
A Few Streaming Algorithms
Data Sciences Course, Jan 2015

Ramesh Hariharan

2

Streaming Algorithms

• Data flying past in a router for example

• Can't afford to store data; operate and forget (or remember what you can in the space available)

• Space is at a premium, how low can we go?

Ramesh Hariharan

3

Streaming Algorithms: An Example

• Given a binary stream of $n$, count the number of 1s

Ramesh Hariharan

4

Streaming Algorithms: An Example

• Given a binary stream of $n$, count the number of 1s

• $\log n$ bits sufficient? necessary?

Ramesh Hariharan

5

Streaming Algorithms: An Example

• Given a binary stream of $n$, count the number of 1s

• $\log n$ bits sufficient? necessary?

• What if we want just an approximate answer? Say within a factor of 2.

Ramesh Hariharan

6

Streaming Algorithms: An Example

• Given a binary stream of $n$, count the number of 1s

• $\log n$ bits sufficient? necessary?

• What if we want just an approximate answer? Say within a factor of 2.

• Create doubling buckets $1\ldots 1,2 \ldots 3,4 \ldots 7,8 \ldots 15,16\ldots 31$

Ramesh Hariharan

7

Streaming Algorithms: An Example

• Given a binary stream of $n$, count the number of 1s

• $\log n$ bits sufficient? necessary?

• What if we want just an approximate answer? Say within a factor of 2.

• Create doubling buckets $1\ldots 1,2 \ldots 3,4 \ldots 7,8 \ldots 15,16\ldots 31$

• Suffices to know which bucket the count lies in.

Ramesh Hariharan

8

Streaming Algorithms: An Example

• Given a binary stream of $n$, count the number of 1s

• $\log n$ bits sufficient? necessary?

• What if we want just an approximate answer? Say within a factor of 2.

• Create doubling buckets $1\ldots 1,2 \ldots 3,4 \ldots 7,8 \ldots 15,16\ldots 31$

• Suffices to know which bucket the count lies in.

• $\log\log n$ bits sufficient to indicate a bucket. Can we also count in this space?

Ramesh Hariharan

9

Streaming Algorithms: An Example

• Given a binary stream of $n$, count the number of 1s

• $\log n$ bits sufficient? necessary?

• What if we want just an approximate answer? Say within a factor of 2.

• Create doubling buckets $1\ldots 1,2 \ldots 3,4 \ldots 7,8 \ldots 15,16\ldots 31$

• Suffices to know which bucket the count lies in.

• $\log\log n$ bits sufficient to indicate a bucket. Can we also count in this space?

Ramesh Hariharan

10

Counting 1s Approximately

• The Challenge: Recognize when a bucket is done and we're moving to the next bucket, so we can increment

Ramesh Hariharan

11

Counting 1s Approximately

• The Challenge: Recognize when a bucket is done and we're moving to the next bucket, so we can increment

• For bucket $2^i\ldots (2^{i+1}-1)$ one needs $i$ bits of count to know that that bucket is over?

Ramesh Hariharan

12

Counting 1s Approximately

• The Challenge: Recognize when a bucket is done and we're moving to the next bucket, so we can increment

• For bucket $2^i\ldots (2^{i+1}-1)$ one needs $i$ bits of count to know that that bucket is over?

• Use the people counting trick instead.

Ramesh Hariharan

13

Counting 1s Approximately

• The Challenge: Recognize when a bucket is done and we're moving to the next bucket, so we can increment

• For bucket $2^i\ldots (2^{i+1}-1)$ one needs $i$ bits of count to know that that bucket is over?

• Use the people counting trick instead.

• Ask everyone to raise their hands, then each person tosses a coin and, heads put their hands down, repeat $k$ rounds until raised hands can be counted. Inflated by $2^k$ to get an estimate of actual count.

Ramesh Hariharan

14

Counting 1s Approximately

• The Challenge: Recognize when a bucket is done and we're moving to the next bucket, so we can increment

• For bucket $2^i\ldots (2^{i+1}-1)$ one needs $i$ bits of count to know that that bucket is over?

• Use the people counting trick instead.

• Ask everyone to raise their hands, then each person tosses a coin and, heads put their hands down, repeat $k$ rounds until raised hands can be counted. Inflated by $2^k$ to get an estimate of actual count.

Ramesh Hariharan

15

Switching Buckets

• Once in bucket $2^i\ldots (2^{i+1}-1)$, with every 1 that arrives, increment the count with prob $\frac{1}{2^i}$ (how exactly?)

Ramesh Hariharan

16

Switching Buckets

• Once in bucket $2^i\ldots (2^{i+1}-1)$, with every 1 that arrives, increment the count with prob $\frac{1}{2^i}$ (how exactly?)

• In other words, for every 1, $c=c+1$ with prob $\frac{1}{2^c}$

Ramesh Hariharan

17

Switching Buckets

• Once in bucket $2^i\ldots (2^{i+1}-1)$, with every 1 that arrives, increment the count with prob $\frac{1}{2^i}$ (how exactly?)

• In other words, for every 1, $c=c+1$ with prob $\frac{1}{2^c}$

• How is the number of 1s related $c$?

Ramesh Hariharan

18

Switching Buckets

• Once in bucket $2^i\ldots (2^{i+1}-1)$, with every 1 that arrives, increment the count with prob $\frac{1}{2^i}$ (how exactly?)

• In other words, for every 1, $c=c+1$ with prob $\frac{1}{2^c}$

• How is the number of 1s related $c$?

• Family of random variables $X_c$ corresponding to count $c$

Ramesh Hariharan

19

Switching Buckets

• Once in bucket $2^i\ldots (2^{i+1}-1)$, with every 1 that arrives, increment the count with prob $\frac{1}{2^i}$ (how exactly?)

• In other words, for every 1, $c=c+1$ with prob $\frac{1}{2^c}$

• How is the number of 1s related $c$?

• Family of random variables $X_c$ corresponding to count $c$

• $X_c$ is the number of 1's required after the count has reached $c$ for the count to be incremented to $c+1$

Ramesh Hariharan

20

Switching Buckets

• Once in bucket $2^i\ldots (2^{i+1}-1)$, with every 1 that arrives, increment the count with prob $\frac{1}{2^i}$ (how exactly?)

• In other words, for every 1, $c=c+1$ with prob $\frac{1}{2^c}$

• How is the number of 1s related $c$?

• Family of random variables $X_c$ corresponding to count $c$

• $X_c$ is the number of 1's required after the count has reached $c$ for the count to be incremented to $c+1$

Ramesh Hariharan

21

Distribution of $X_c$

• $E(X_c)$

Ramesh Hariharan

22

Distribution of $X_c$

• $E(X_c)$
• $=\sum_{i=1} ^ {\infty} Pr(X_c\geq i)=1+ \sum_{i=1}^{\infty} Pr(X_c>i)$

Ramesh Hariharan

23

Distribution of $X_c$

• $E(X_c)$
• $=\sum_{i=1} ^ {\infty} Pr(X_c\geq i)=1+ \sum_{i=1}^{\infty} Pr(X_c>i)$
• $=1+\sum_{i=1}^{\infty} (1-\frac{1}{2^c})^i = 2^c$, as expected

Ramesh Hariharan

24

Distribution of $X_c$

• $E(X_c)$

• $=\sum_{i=1} ^ {\infty} Pr(X_c\geq i)=1+ \sum_{i=1}^{\infty} Pr(X_c>i)$
• $=1+\sum_{i=1}^{\infty} (1-\frac{1}{2^c})^i = 2^c$, as expected
• $E(X_c^2) = \sum_{i=1}^{\infty} Pr(X_c^2\geq i)= \sum_{i=1}^{\infty} Pr(X_c \geq i)(i^2-(i-1)^2)$

Ramesh Hariharan

25

Distribution of $X_c$

• $E(X_c)$

• $=\sum_{i=1} ^ {\infty} Pr(X_c\geq i)=1+ \sum_{i=1}^{\infty} Pr(X_c>i)$
• $=1+\sum_{i=1}^{\infty} (1-\frac{1}{2^c})^i = 2^c$, as expected
• $E(X_c^2) = \sum_{i=1}^{\infty} Pr(X_c^2\geq i)= \sum_{i=1}^{\infty} Pr(X_c \geq i)(i^2-(i-1)^2)$

• $\sum_{i=1}^{\infty} (1-\frac{1}{2^c})^{i-1}(2i-1)\leq 2.2^{2c}$

Ramesh Hariharan

26

Distribution of $X_c$

• $E(X_c)$

• $=\sum_{i=1} ^ {\infty} Pr(X_c\geq i)=1+ \sum_{i=1}^{\infty} Pr(X_c>i)$
• $=1+\sum_{i=1}^{\infty} (1-\frac{1}{2^c})^i = 2^c$, as expected
• $E(X_c^2) = \sum_{i=1}^{\infty} Pr(X_c^2\geq i)= \sum_{i=1}^{\infty} Pr(X_c \geq i)(i^2-(i-1)^2)$

• $\sum_{i=1}^{\infty} (1-\frac{1}{2^c})^{i-1}(2i-1)\leq 2.2^{2c}$
• $Var(X_c) \leq 2^{2c}$

Ramesh Hariharan

27

Distribution of $X_c$

• $E(X_c)$

• $=\sum_{i=1} ^ {\infty} Pr(X_c\geq i)=1+ \sum_{i=1}^{\infty} Pr(X_c>i)$
• $=1+\sum_{i=1}^{\infty} (1-\frac{1}{2^c})^i = 2^c$, as expected
• $E(X_c^2) = \sum_{i=1}^{\infty} Pr(X_c^2\geq i)= \sum_{i=1}^{\infty} Pr(X_c \geq i)(i^2-(i-1)^2)$

• $\sum_{i=1}^{\infty} (1-\frac{1}{2^c})^{i-1}(2i-1)\leq 2.2^{2c}$
• $Var(X_c) \leq 2^{2c}$

Ramesh Hariharan

28

Distribution of $X=\sum_0^{k-1} X_c$

• Recall $X_c$, how many 1's before counter increments from $c$

Ramesh Hariharan

29

Distribution of $X=\sum_0^{k-1} X_c$

• Recall $X_c$, how many 1's before counter increments from $c$

• $X$ reflects the number of 1's for the count to reach $k$ from 0

Ramesh Hariharan

30

Distribution of $X=\sum_0^{k-1} X_c$

• Recall $X_c$, how many 1's before counter increments from $c$

• $X$ reflects the number of 1's for the count to reach $k$ from 0

• $E(X) = 2^{k-1}+2^{k-2}+\cdots+2^0= 2^{k}-1$

Ramesh Hariharan

31

Distribution of $X=\sum_0^{k-1} X_c$

• Recall $X_c$, how many 1's before counter increments from $c$

• $X$ reflects the number of 1's for the count to reach $k$ from 0

• $E(X) = 2^{k-1}+2^{k-2}+\cdots+2^0= 2^{k}-1$

• $Var(X) \leq 2^{2(k-1)}+\cdots+2^{2.0}=(4^k-1)\frac{4}{3}$, by independence

Ramesh Hariharan

32

Distribution of $X=\sum_0^{k-1} X_c$

• Recall $X_c$, how many 1's before counter increments from $c$

• $X$ reflects the number of 1's for the count to reach $k$ from 0

• $E(X) = 2^{k-1}+2^{k-2}+\cdots+2^0= 2^{k}-1$

• $Var(X) \leq 2^{2(k-1)}+\cdots+2^{2.0}=(4^k-1)\frac{4}{3}$, by independence

• By Chebyschev's Ineq, $Pr(|X-(2^k-1)|>\epsilon (2^k-1))\leq \frac{4}{3\epsilon^2}$

Ramesh Hariharan

33

Distribution of $X=\sum_0^{k-1} X_c$

• Recall $X_c$, how many 1's before counter increments from $c$

• $X$ reflects the number of 1's for the count to reach $k$ from 0

• $E(X) = 2^{k-1}+2^{k-2}+\cdots+2^0= 2^{k}-1$

• $Var(X) \leq 2^{2(k-1)}+\cdots+2^{2.0}=(4^k-1)\frac{4}{3}$, by independence

• By Chebyschev's Ineq, $Pr(|X-(2^k-1)|>\epsilon (2^k-1))\leq \frac{4}{3\epsilon^2}$

• Meaningful only for $\epsilon\geq \sqrt{4/3}$

Ramesh Hariharan

34

Distribution of $X=\sum_0^{k-1} X_c$

• Recall $X_c$, how many 1's before counter increments from $c$

• $X$ reflects the number of 1's for the count to reach $k$ from 0

• $E(X) = 2^{k-1}+2^{k-2}+\cdots+2^0= 2^{k}-1$

• $Var(X) \leq 2^{2(k-1)}+\cdots+2^{2.0}=(4^k-1)\frac{4}{3}$, by independence

• By Chebyschev's Ineq, $Pr(|X-(2^k-1)|>\epsilon (2^k-1))\leq \frac{4}{3\epsilon^2}$

• Meaningful only for $\epsilon\geq \sqrt{4/3}$

Ramesh Hariharan

35

Use Many Counters?

• What if we use many counters and then average the count?

Ramesh Hariharan

36

Use Many Counters?

• What if we use many counters and then average the count?

• We need to fix the number of 1s and then study the distribution of the resulting count; let $X_n$ be the count for $n$ 1s?

Ramesh Hariharan

37

Use Many Counters?

• What if we use many counters and then average the count?

• We need to fix the number of 1s and then study the distribution of the resulting count; let $X_n$ be the count for $n$ 1s?

• $E(2^{X_n})=E(2^{X_{n-1}})+1=n+1$ $\rightarrow$ exercise

Ramesh Hariharan

38

Use Many Counters?

• What if we use many counters and then average the count?

• We need to fix the number of 1s and then study the distribution of the resulting count; let $X_n$ be the count for $n$ 1s?

• $E(2^{X_n})=E(2^{X_{n-1}})+1=n+1$ $\rightarrow$ exercise

• $E(2^{2X_n})=E(2^{2X_{n-1}})+3n=3n(n+1)/2+1$ $\rightarrow$ exercise

Ramesh Hariharan

39

Use Many Counters?

• What if we use many counters and then average the count?

• We need to fix the number of 1s and then study the distribution of the resulting count; let $X_n$ be the count for $n$ 1s?

• $E(2^{X_n})=E(2^{X_{n-1}})+1=n+1$ $\rightarrow$ exercise

• $E(2^{2X_n})=E(2^{2X_{n-1}})+3n=3n(n+1)/2+1$ $\rightarrow$ exercise

• $Var(2^{X_n})=3n(n+1)/2+1-(n+1)^2= \frac{n(n-1)}{2}$

Ramesh Hariharan

40

Use Many Counters?

• What if we use many counters and then average the count?

• We need to fix the number of 1s and then study the distribution of the resulting count; let $X_n$ be the count for $n$ 1s?

• $E(2^{X_n})=E(2^{X_{n-1}})+1=n+1$ $\rightarrow$ exercise

• $E(2^{2X_n})=E(2^{2X_{n-1}})+3n=3n(n+1)/2+1$ $\rightarrow$ exercise

• $Var(2^{X_n})=3n(n+1)/2+1-(n+1)^2= \frac{n(n-1)}{2}$

Ramesh Hariharan

41

Proof of Expectation and Variance

• After $n-1$ 1s have been seen, let the prob that the counter carries value $i$ be $p_i$

• After $n$ 1s, the prob that counter carries $i$ is $p_{i-1}\frac{1}{2^{i-1}}+p_{i}(1-\frac{1}{2^i})$

• $E(2^X_n)=\sum_i 2^i [p_{i-1}\frac{1}{2^{i-1}}+p_i(1-\frac{1}{2^i})]=\sum_i 2^i p_i + \sum p_i = E(2^{X_{n-1}})+1$

• $E(2^{2X_n})=\sum_i 2^{2i} [p_{i-1}\frac{1}{2^{i-1}}+p_i(1-\frac{1}{2^i})]=\sum_i 2^{2i} p_i + 4 \sum_i 2^{i-1} p_{i-1} - \sum 2^i p_i$ = $E(2^{2X_{n-1}})+3E(2^{X_{n-1}})=E(2^{2X_{n-1}})+3n$

Ramesh Hariharan

42

Using $l$ Independent Counters

• Dropping subscript $n$, $E(2^X-1)=n$, $Var(2^X-1)= \frac{n(n-1)}{2}$

Ramesh Hariharan

43

Using $l$ Independent Counters

• Dropping subscript $n$, $E(2^X-1)=n$, $Var(2^X-1)= \frac{n(n-1)}{2}$

• Many counters $Y_1,\ldots,Y_l$, $E(2^{Y_i}-1)=n$, $Var(2^{Y_i}-1)= \frac{n(n-1)}{2}$

Ramesh Hariharan

44

Using $l$ Independent Counters

• Dropping subscript $n$, $E(2^X-1)=n$, $Var(2^X-1)= \frac{n(n-1)}{2}$

• Many counters $Y_1,\ldots,Y_l$, $E(2^{Y_i}-1)=n$, $Var(2^{Y_i}-1)= \frac{n(n-1)}{2}$

• Take average count: denote $2^Y-1=\frac{\sum_i (2^Y_i-1)}{l}$, $E(2^Y-1)=n$, $Var(2^Y-1)=\frac{n(n-1)}{2l}$

Ramesh Hariharan

45

Using $l$ Independent Counters

• Dropping subscript $n$, $E(2^X-1)=n$, $Var(2^X-1)= \frac{n(n-1)}{2}$

• Many counters $Y_1,\ldots,Y_l$, $E(2^{Y_i}-1)=n$, $Var(2^{Y_i}-1)= \frac{n(n-1)}{2}$

• Take average count: denote $2^Y-1=\frac{\sum_i (2^Y_i-1)}{l}$, $E(2^Y-1)=n$, $Var(2^Y-1)=\frac{n(n-1)}{2l}$

• By Chebyschev's Ineq, $Pr(|2^Y-1-n|>\epsilon n)\leq \frac{1}{2l\epsilon^2}$

Ramesh Hariharan

46

Using $l$ Independent Counters

• Dropping subscript $n$, $E(2^X-1)=n$, $Var(2^X-1)= \frac{n(n-1)}{2}$

• Many counters $Y_1,\ldots,Y_l$, $E(2^{Y_i}-1)=n$, $Var(2^{Y_i}-1)= \frac{n(n-1)}{2}$

• Take average count: denote $2^Y-1=\frac{\sum_i (2^Y_i-1)}{l}$, $E(2^Y-1)=n$, $Var(2^Y-1)=\frac{n(n-1)}{2l}$

• By Chebyschev's Ineq, $Pr(|2^Y-1-n|>\epsilon n)\leq \frac{1}{2l\epsilon^2}$

• Meaningful for $\epsilon\geq \sqrt{1/2l}$, with $l\log\log n$ bits

Ramesh Hariharan

47

Using $l$ Independent Counters

• Dropping subscript $n$, $E(2^X-1)=n$, $Var(2^X-1)= \frac{n(n-1)}{2}$

• Many counters $Y_1,\ldots,Y_l$, $E(2^{Y_i}-1)=n$, $Var(2^{Y_i}-1)= \frac{n(n-1)}{2}$

• Take average count: denote $2^Y-1=\frac{\sum_i (2^Y_i-1)}{l}$, $E(2^Y-1)=n$, $Var(2^Y-1)=\frac{n(n-1)}{2l}$

• By Chebyschev's Ineq, $Pr(|2^Y-1-n|>\epsilon n)\leq \frac{1}{2l\epsilon^2}$

• Meaningful for $\epsilon\geq \sqrt{1/2l}$, with $l\log\log n$ bits

Ramesh Hariharan

48

Finding the Number of Distinct Elements

• Given a stream of numbers drawn from the set ${\cal M}=\{1,2,\ldots,M\}$, find the number of distinct items in the stream.

Ramesh Hariharan

49

Finding the Number of Distinct Elements

• Given a stream of numbers drawn from the set ${\cal M}=\{1,2,\ldots,M\}$, find the number of distinct items in the stream.

• Needs at least M bits, right? For we have to determine the subset of $\cal M$ which appears in the stream.

Ramesh Hariharan

50

Finding the Number of Distinct Elements

• Given a stream of numbers drawn from the set ${\cal M}=\{1,2,\ldots,M\}$, find the number of distinct items in the stream.

• Needs at least M bits, right? For we have to determine the subset of $\cal M$ which appears in the stream.

• Claim: Can do approximately, with good probability, using $O(\log M)$ bits!

Ramesh Hariharan

51

A Related Toy Problem

• Let $S=\{a_1,a_2,\ldots,a_k\}$ be the distinct elements drawn from ${\cal M}=\{1,\ldots,M\}$

• Suppose $S$ were a random subset of $\cal M$ of size $k$

• All subsets of $\cal M$ of size $k$ are equally likely candidates for $S$

• We need to estimate $k=|S|$

Ramesh Hariharan

52

The Main Idea

• Let $X= \min\ \{a|a\in S\}$

• $X$ is a random variable because $S$ is a random subset of $\cal M$ of size $k$

• Claim: $E(X)=\frac{M}{k+1}$.

• Claim: $X$ is (modestly) tightly distributed around this expectation

• With high probability, $k$ is close to $\frac{M}{X}-1$

Ramesh Hariharan

53

Expectation and Variance

• $E(X)=\sum_{i=1}^M Pr(X\geq i) = \sum_{i=1}^M (\frac{M-i+1}{M})^k$

Ramesh Hariharan

54

Expectation and Variance

• $E(X)=\sum_{i=1}^M Pr(X\geq i) = \sum_{i=1}^M (\frac{M-i+1}{M})^k$
• $\int_{i=1}^{M+1} (\frac{M-i+1}{M})^k \mbox{d}i \cdots 1+\int_{i=1}^{M} (\frac{M-i+1}{M})^k \mbox{d}i \subseteq \frac{M}{k+1}\cdots \frac{M}{k+1}+1$

Ramesh Hariharan

55

Expectation and Variance

• $E(X)=\sum_{i=1}^M Pr(X\geq i) = \sum_{i=1}^M (\frac{M-i+1}{M})^k$
• $\int_{i=1}^{M+1} (\frac{M-i+1}{M})^k \mbox{d}i \cdots 1+\int_{i=1}^{M} (\frac{M-i+1}{M})^k \mbox{d}i \subseteq \frac{M}{k+1}\cdots \frac{M}{k+1}+1$
• $E(X^2)=\sum_{i=1}^{M^2} Pr(X^2\geq i) = \sum_{i=1}^M Pr(X\geq i)(i^2-(i-1)^2)$

Ramesh Hariharan

56

Expectation and Variance

• $E(X)=\sum_{i=1}^M Pr(X\geq i) = \sum_{i=1}^M (\frac{M-i+1}{M})^k$
• $\int_{i=1}^{M+1} (\frac{M-i+1}{M})^k \mbox{d}i \cdots 1+\int_{i=1}^{M} (\frac{M-i+1}{M})^k \mbox{d}i \subseteq \frac{M}{k+1}\cdots \frac{M}{k+1}+1$
• $E(X^2)=\sum_{i=1}^{M^2} Pr(X^2\geq i) = \sum_{i=1}^M Pr(X\geq i)(i^2-(i-1)^2)$
• $\sum_{i=1}^M Pr(X\geq i)(2i-1)$

Ramesh Hariharan

57

Expectation and Variance

• $E(X)=\sum_{i=1}^M Pr(X\geq i) = \sum_{i=1}^M (\frac{M-i+1}{M})^k$
• $\int_{i=1}^{M+1} (\frac{M-i+1}{M})^k \mbox{d}i \cdots 1+\int_{i=1}^{M} (\frac{M-i+1}{M})^k \mbox{d}i \subseteq \frac{M}{k+1}\cdots \frac{M}{k+1}+1$
• $E(X^2)=\sum_{i=1}^{M^2} Pr(X^2\geq i) = \sum_{i=1}^M Pr(X\geq i)(i^2-(i-1)^2)$
• $\sum_{i=1}^M Pr(X\geq i)(2i-1)$
• $\leq \int_{i=1}^{M+1} (\frac{M-i+1}{M})^k (2i) \mbox{d}i +2 -E(X)= 2\frac{M^2}{(k+1)(k+2)}+\frac{M}{k+1}+2$

Ramesh Hariharan

58

Expectation and Variance

• $E(X)=\sum_{i=1}^M Pr(X\geq i) = \sum_{i=1}^M (\frac{M-i+1}{M})^k$
• $\int_{i=1}^{M+1} (\frac{M-i+1}{M})^k \mbox{d}i \cdots 1+\int_{i=1}^{M} (\frac{M-i+1}{M})^k \mbox{d}i \subseteq \frac{M}{k+1}\cdots \frac{M}{k+1}+1$
• $E(X^2)=\sum_{i=1}^{M^2} Pr(X^2\geq i) = \sum_{i=1}^M Pr(X\geq i)(i^2-(i-1)^2)$
• $\sum_{i=1}^M Pr(X\geq i)(2i-1)$
• $\leq \int_{i=1}^{M+1} (\frac{M-i+1}{M})^k (2i) \mbox{d}i +2 -E(X)= 2\frac{M^2}{(k+1)(k+2)}+\frac{M}{k+1}+2$
• $Var(X)\leq 2\frac{M^2}{(k+1)(k+2)}+\frac{M}{k+1}+2-\frac{M^2}{(k+1)^2}\leq (\frac{M}{k+1}+1)^2$

Ramesh Hariharan

59

Expectation and Variance

• $E(X)=\sum_{i=1}^M Pr(X\geq i) = \sum_{i=1}^M (\frac{M-i+1}{M})^k$
• $\int_{i=1}^{M+1} (\frac{M-i+1}{M})^k \mbox{d}i \cdots 1+\int_{i=1}^{M} (\frac{M-i+1}{M})^k \mbox{d}i \subseteq \frac{M}{k+1}\cdots \frac{M}{k+1}+1$
• $E(X^2)=\sum_{i=1}^{M^2} Pr(X^2\geq i) = \sum_{i=1}^M Pr(X\geq i)(i^2-(i-1)^2)$
• $\sum_{i=1}^M Pr(X\geq i)(2i-1)$
• $\leq \int_{i=1}^{M+1} (\frac{M-i+1}{M})^k (2i) \mbox{d}i +2 -E(X)= 2\frac{M^2}{(k+1)(k+2)}+\frac{M}{k+1}+2$
• $Var(X)\leq 2\frac{M^2}{(k+1)(k+2)}+\frac{M}{k+1}+2-\frac{M^2}{(k+1)^2}\leq (\frac{M}{k+1}+1)^2$
• By Chebyschev's Ineq, $Pr(|X-E(X)|>c E(X))\leq \frac{(\frac{M}{k+1}+1)^2}{c^2 (\frac{M}{k+1})^2}\leq \frac{2}{c^2}$

Ramesh Hariharan

60

But $S$ is not a Random Subset

• $S=\{a_1,a_2,\ldots,a_k\}$ is not a random subset of ${ \cal M}= \{1,2,\ldots,M\}$

Ramesh Hariharan

61

But $S$ is not a Random Subset

• $S=\{a_1,a_2,\ldots,a_k\}$ is not a random subset of ${ \cal M}= \{1,2,\ldots,M\}$

• Idea: Permute the items in ${ \cal M}= \{1,2,\ldots,M\}$ randomly using a bijective function $f:{\cal M}\rightarrow {\cal M}$

Ramesh Hariharan

62

But $S$ is not a Random Subset

• $S=\{a_1,a_2,\ldots,a_k\}$ is not a random subset of ${ \cal M}= \{1,2,\ldots,M\}$

• Idea: Permute the items in ${ \cal M}= \{1,2,\ldots,M\}$ randomly using a bijective function $f:{\cal M}\rightarrow {\cal M}$

• Then $f(S)=\{f(a_1),f(a_2),\ldots,f(a_k)\}$ is a random subset of ${ \cal M}$

Ramesh Hariharan

63

But $S$ is not a Random Subset

• $S=\{a_1,a_2,\ldots,a_k\}$ is not a random subset of ${ \cal M}= \{1,2,\ldots,M\}$

• Idea: Permute the items in ${ \cal M}= \{1,2,\ldots,M\}$ randomly using a bijective function $f:{\cal M}\rightarrow {\cal M}$

• Then $f(S)=\{f(a_1),f(a_2),\ldots,f(a_k)\}$ is a random subset of ${ \cal M}$

• Algo: Keep track of min item after applying $f$

Ramesh Hariharan

64

But $S$ is not a Random Subset

• $S=\{a_1,a_2,\ldots,a_k\}$ is not a random subset of ${ \cal M}= \{1,2,\ldots,M\}$

• Idea: Permute the items in ${ \cal M}= \{1,2,\ldots,M\}$ randomly using a bijective function $f:{\cal M}\rightarrow {\cal M}$

• Then $f(S)=\{f(a_1),f(a_2),\ldots,f(a_k)\}$ is a random subset of ${ \cal M}$

• Algo: Keep track of min item after applying $f$

• Challenge: $f$ needs $M\log M$ bits; way too many!

Ramesh Hariharan

65

A Different Type of Permutation

• Idea: Simplify $f$. Assume: $M$ is a prime, if not estimate a prime or a prime power nearby

Ramesh Hariharan

66

A Different Type of Permutation

• Idea: Simplify $f$. Assume: $M$ is a prime, if not estimate a prime or a prime power nearby

• Pick $\alpha,\beta$ uniformly at random from ${\cal M}=\{1,\ldots,M\}$, set $f(x)=\alpha (x-1) + \beta \ \ (\bmod M)+1$ : [Only $2\log M$ bits]

Ramesh Hariharan

67

A Different Type of Permutation

• Idea: Simplify $f$. Assume: $M$ is a prime, if not estimate a prime or a prime power nearby

• Pick $\alpha,\beta$ uniformly at random from ${\cal M}=\{1,\ldots,M\}$, set $f(x)=\alpha (x-1) + \beta \ \ (\bmod M)+1$ : [Only $2\log M$ bits]

• $\forall x,i\in {\cal M}, Pr(f(x))=i)= \frac{1}{M}$

Ramesh Hariharan

68

A Different Type of Permutation

• Idea: Simplify $f$. Assume: $M$ is a prime, if not estimate a prime or a prime power nearby

• Pick $\alpha,\beta$ uniformly at random from ${\cal M}=\{1,\ldots,M\}$, set $f(x)=\alpha (x-1) + \beta \ \ (\bmod M)+1$ : [Only $2\log M$ bits]

• $\forall x,i\in {\cal M}, Pr(f(x))=i)= \frac{1}{M}$

• $\forall x,i,y,j\in {\cal M}, Pr([f(x))=i] \land [f(y)=j])= \frac{1}{M^2}$

Ramesh Hariharan

69

A Different Type of Permutation

• Idea: Simplify $f$. Assume: $M$ is a prime, if not estimate a prime or a prime power nearby

• Pick $\alpha,\beta$ uniformly at random from ${\cal M}=\{1,\ldots,M\}$, set $f(x)=\alpha (x-1) + \beta \ \ (\bmod M)+1$ : [Only $2\log M$ bits]

• $\forall x,i\in {\cal M}, Pr(f(x))=i)= \frac{1}{M}$

• $\forall x,i,y,j\in {\cal M}, Pr([f(x))=i] \land [f(y)=j])= \frac{1}{M^2}$

• Independence stops here

Ramesh Hariharan

70

Performance

• $Pr(X>c\frac{M}{k+1})\leq ?$

Ramesh Hariharan

71

Performance

• $Pr(X>c\frac{M}{k+1})\leq ?$

• $X_i=0$ if $f(a_i)\leq c\frac{M}{k+1}$, $Pr(X_i=0)=\frac{c}{k+1}$

Ramesh Hariharan

72

Performance

• $Pr(X>c\frac{M}{k+1})\leq ?$

• $X_i=0$ if $f(a_i)\leq c\frac{M}{k+1}$, $Pr(X_i=0)=\frac{c}{k+1}$

• $E(X_i)=1-\frac{c}{k+1}, Var(X_i)=(1-\frac{c}{k+1})\frac{c}{k+1}$

Ramesh Hariharan

73

Performance

• $Pr(X>c\frac{M}{k+1})\leq ?$

• $X_i=0$ if $f(a_i)\leq c\frac{M}{k+1}$, $Pr(X_i=0)=\frac{c}{k+1}$

• $E(X_i)=1-\frac{c}{k+1}, Var(X_i)=(1-\frac{c}{k+1})\frac{c}{k+1}$

• $E(\sum_{i=1}^k X_i)=k(1-\frac{c}{k+1})$, $Var(\sum_{i=1}^k X_i)=(1-\frac{c}{k+1})\frac{ck}{k+1}$, even with just pairwise ind

Ramesh Hariharan

74

Performance

• $Pr(X>c\frac{M}{k+1})\leq ?$

• $X_i=0$ if $f(a_i)\leq c\frac{M}{k+1}$, $Pr(X_i=0)=\frac{c}{k+1}$

• $E(X_i)=1-\frac{c}{k+1}, Var(X_i)=(1-\frac{c}{k+1})\frac{c}{k+1}$

• $E(\sum_{i=1}^k X_i)=k(1-\frac{c}{k+1})$, $Var(\sum_{i=1}^k X_i)=(1-\frac{c}{k+1})\frac{ck}{k+1}$, even with just pairwise ind

• $Pr(\sum_{i=1}^k X_i\geq k)=Pr(\sum_{i=1}^k X_i-k(1-\frac{c}{k+1})\geq \frac{ck}{k+1})\leq \frac{Var(\sum_{i=1}^k X_i)}{(\frac{ck}{k+1})^2}\leq \frac{k+1}{ck}$

Ramesh Hariharan

75

Performance Contd.

• $Pr(X>c\frac{M}{k+1})\leq \frac{1}{c} \frac{k+1}{k}$

Ramesh Hariharan

76

Performance Contd.

• $Pr(X>c\frac{M}{k+1})\leq \frac{1}{c} \frac{k+1}{k}$

• $Pr(X<\frac{1}{c}\frac{M}{k+1})\leq \frac{1}{c}\frac{k}{k+1}$

Ramesh Hariharan

77

Performance Contd.

• $Pr(X>c\frac{M}{k+1})\leq \frac{1}{c} \frac{k+1}{k}$

• $Pr(X<\frac{1}{c}\frac{M}{k+1})\leq \frac{1}{c}\frac{k}{k+1}$

• With probability $1-\frac{1}{c}\frac{k}{k+1}-\frac{1}{c}\frac{k+1}{k}$, $\frac{k+1}{c}< \frac{M}{X}< c(k+1)$

Ramesh Hariharan

78

Performance Contd.

• $Pr(X>c\frac{M}{k+1})\leq \frac{1}{c} \frac{k+1}{k}$

• $Pr(X<\frac{1}{c}\frac{M}{k+1})\leq \frac{1}{c}\frac{k}{k+1}$

• With probability $1-\frac{1}{c}\frac{k}{k+1}-\frac{1}{c}\frac{k+1}{k}$, $\frac{k+1}{c}< \frac{M}{X}< c(k+1)$

• Space Usage: $2\log M$ bits!

Ramesh Hariharan

79

Finding the Second Moment

• Given a stream of numbers drawn from the set $\{1,2,\ldots,M\}$ with frequencies $f_1,\ldots,f_M$, estimate the sum of squares of the frequencies of the individual items.

Ramesh Hariharan

80

Finding the Second Moment

• Given a stream of numbers drawn from the set $\{1,2,\ldots,M\}$ with frequencies $f_1,\ldots,f_M$, estimate the sum of squares of the frequencies of the individual items.

• Naive: Keep track of each frequency, at least $M$ words required

Ramesh Hariharan

81

Finding the Second Moment

• Given a stream of numbers drawn from the set $\{1,2,\ldots,M\}$ with frequencies $f_1,\ldots,f_M$, estimate the sum of squares of the frequencies of the individual items.

• Naive: Keep track of each frequency, at least $M$ words required

• Can one do better?

Ramesh Hariharan

82

Finding the Second Moment

• Idea: Track $F=(\sum_{i=1}^M r_if_i)^2$, where $r_i$ are independent -1/1 random bits, each with mean 0

Ramesh Hariharan

83

Finding the Second Moment

• Idea: Track $F=(\sum_{i=1}^M r_if_i)^2$, where $r_i$ are independent -1/1 random bits, each with mean 0

• Need to store only $M$ bits, not $M$ words.

Ramesh Hariharan

84

Finding the Second Moment

• Idea: Track $F=(\sum_{i=1}^M r_if_i)^2$, where $r_i$ are independent -1/1 random bits, each with mean 0

• Need to store only $M$ bits, not $M$ words.

• $E[F]=E[(\sum_{i=1}^M r_if_i)^2]=\sum_{i=1}^M f_i^2$, so $F$ is a good estimator

Ramesh Hariharan

85

Finding the Second Moment

• Idea: Track $F=(\sum_{i=1}^M r_if_i)^2$, where $r_i$ are independent -1/1 random bits, each with mean 0

• Need to store only $M$ bits, not $M$ words.

• $E[F]=E[(\sum_{i=1}^M r_if_i)^2]=\sum_{i=1}^M f_i^2$, so $F$ is a good estimator

• $Var[F] =E[F^2]-(E[F])^2= E[(\sum_{i=1}^M r_if_i)^4]-(\sum_{i=1}^M f_i^2)^2$

Ramesh Hariharan

86

Finding the Second Moment

• Idea: Track $F=(\sum_{i=1}^M r_if_i)^2$, where $r_i$ are independent -1/1 random bits, each with mean 0

• Need to store only $M$ bits, not $M$ words.

• $E[F]=E[(\sum_{i=1}^M r_if_i)^2]=\sum_{i=1}^M f_i^2$, so $F$ is a good estimator

• $Var[F] =E[F^2]-(E[F])^2= E[(\sum_{i=1}^M r_if_i)^4]-(\sum_{i=1}^M f_i^2)^2$

• $= \sum_{i=1}^M f_i^4 + 6\sum_{i,j=1}^M f_i^2 f_j^2 - (\sum_{i=1}^M f_i^2)^2\leq 2(\sum_{i=1}^M f_i^2)^2 = 2 (E[F])^2$

Ramesh Hariharan

87

Finding the Second Moment

• How do we compress Variance?

Ramesh Hariharan

88

Finding the Second Moment

• How do we compress Variance?

• Repeat with many independent sets of random bits and take the average

Ramesh Hariharan

89

Finding the Second Moment

• How do we compress Variance?

• Repeat with many independent sets of random bits and take the average

• We get $F_1,\ldots,F_k$, where $F_j=(\sum_{i=1}^M r^j_if_i)^2$

Ramesh Hariharan

90

Finding the Second Moment

• How do we compress Variance?

• Repeat with many independent sets of random bits and take the average

• We get $F_1,\ldots,F_k$, where $F_j=(\sum_{i=1}^M r^j_if_i)^2$

• Need to store only $Mk$ bits now

Ramesh Hariharan

91

Finding the Second Moment

• How do we compress Variance?

• Repeat with many independent sets of random bits and take the average

• We get $F_1,\ldots,F_k$, where $F_j=(\sum_{i=1}^M r^j_if_i)^2$

• Need to store only $Mk$ bits now

• $F=\frac{1}{k} \sum_{1}^k F_j$, $E[F]=\sum_{i=1}^M f_i^2$, $Var[F]=\frac{2}{k} (E[F])^2$

Ramesh Hariharan

92

Finding the Second Moment

• How do we compress Variance?

• Repeat with many independent sets of random bits and take the average

• We get $F_1,\ldots,F_k$, where $F_j=(\sum_{i=1}^M r^j_if_i)^2$

• Need to store only $Mk$ bits now

• $F=\frac{1}{k} \sum_{1}^k F_j$, $E[F]=\sum_{i=1}^M f_i^2$, $Var[F]=\frac{2}{k} (E[F])^2$

• $Prob(|F-E(F)|>\epsilon E(F))\leq \frac{Var(F)}{\epsilon^2 E(F)^2}= \frac{2}{k\epsilon^2}$

Ramesh Hariharan

93

Finding the Second Moment

• How do we reduce further from $Mk$ bits?

Ramesh Hariharan

94

Finding the Second Moment

• How do we reduce further from $Mk$ bits?

• Idea: The random bits $r_1,\ldots,r_M$ need be only 4-way independent for $Var[F]\leq \frac{2}{k} (E[F])^2$

Ramesh Hariharan

95

Finding the Second Moment

• How do we reduce further from $Mk$ bits?

• Idea: The random bits $r_1,\ldots,r_M$ need be only 4-way independent for $Var[F]\leq \frac{2}{k} (E[F])^2$

• Earlier we assumed $M$ is a prime and worked with the field mod $M$. Slightly messy to do this now, because we want $E[r_i]=0$

Ramesh Hariharan

96

Finding the Second Moment

• How do we reduce further from $Mk$ bits?

• Idea: The random bits $r_1,\ldots,r_M$ need be only 4-way independent for $Var[F]\leq \frac{2}{k} (E[F])^2$

• Earlier we assumed $M$ is a prime and worked with the field mod $M$. Slightly messy to do this now, because we want $E[r_i]=0$

• Instead work with field $\cal F$ of size $2^l$ bigger than $M$ (polynomials of degree $l$ with boolean coefficients, taken modulo an irreducible polynomial $\cal P$ of degree $l$)

Ramesh Hariharan

97

Finding the Second Moment

• How do we reduce further from $Mk$ bits?

• Idea: The random bits $r_1,\ldots,r_M$ need be only 4-way independent for $Var[F]\leq \frac{2}{k} (E[F])^2$

• Earlier we assumed $M$ is a prime and worked with the field mod $M$. Slightly messy to do this now, because we want $E[r_i]=0$

• Instead work with field $\cal F$ of size $2^l$ bigger than $M$ (polynomials of degree $l$ with boolean coefficients, taken modulo an irreducible polynomial $\cal P$ of degree $l$)

Ramesh Hariharan

98

Finding the Second Moment

• Choose $\alpha,\beta,\gamma,\delta$ uniformly at random from $\cal F$

Ramesh Hariharan

99

Finding the Second Moment

• Choose $\alpha,\beta,\gamma,\delta$ uniformly at random from $\cal F$

• For each $x\in {\cal F}$, define $r_x$ as the first bit of the $l$-bit representation of $\alpha+\beta x + \gamma x^2 +\delta x^3 \in \cal F$

Ramesh Hariharan

100

Finding the Second Moment

• Choose $\alpha,\beta,\gamma,\delta$ uniformly at random from $\cal F$

• For each $x\in {\cal F}$, define $r_x$ as the first bit of the $l$-bit representation of $\alpha+\beta x + \gamma x^2 +\delta x^3 \in \cal F$

• $\forall x\in {\cal F}, E(r_x)=0$ (after making $r(x)$ +/-1)

Ramesh Hariharan

101

Finding the Second Moment

• Choose $\alpha,\beta,\gamma,\delta$ uniformly at random from $\cal F$

• For each $x\in {\cal F}$, define $r_x$ as the first bit of the $l$-bit representation of $\alpha+\beta x + \gamma x^2 +\delta x^3 \in \cal F$

• $\forall x\in {\cal F}, E(r_x)=0$ (after making $r(x)$ +/-1)

• $Pr(r_w|r_x,r_y,r_z)=r_w$, i.e., 4-way independence

Ramesh Hariharan

102

Finding the Second Moment

• Choose $\alpha,\beta,\gamma,\delta$ uniformly at random from $\cal F$

• For each $x\in {\cal F}$, define $r_x$ as the first bit of the $l$-bit representation of $\alpha+\beta x + \gamma x^2 +\delta x^3 \in \cal F$

• $\forall x\in {\cal F}, E(r_x)=0$ (after making $r(x)$ +/-1)

• $Pr(r_w|r_x,r_y,r_z)=r_w$, i.e., 4-way independence

• Number of bits needed is $4*l*k=4 \log M \frac{O(1)}{\epsilon^2}$

Ramesh Hariharan

Share
×