@nrailgun 2015-10-02T09:11:31.000000Z 字数 4319 阅读 1737

Mining Data Streams

机器学习

Filtering Data Stream

Problem: Find if element $a$ in data stream is in set $S$ .

Bloom filter is a good idea. Bloom filter produce no false negative, but some false positive.

Bloom filter

Consider $|S| = m$ , $|B| = n$ . $B$ is $n$ bit array filled with $0$ . $h_i$ is hash function, $i = 1, 2, \dots, n$ .

Initialization

Hash each element $s \in S$ using each hash $h_i$ , set $B[h_i(s)] = 1$ .

Run-time

For element $x$ in data stream, if $B[h_i(x)] = 1$ for all $i$ , declare that x is in $S$ , else discard it.

Counting Distinct Elements

Problem: Maintain a count of the distinct elements seen so far.

Flajolet-Martin Approach

Pick a hash function $h$ that maps each of $N$ elements to $\log_2(N)$ bits. For each stream element $a$ , let $r(a)$ be the number of trailing $0$ s in $h(a)$ . E.g., say $h(a) = 12 = (1100)_2$ , so $r(a) = 2$ .

Let $R = \max_a r(a)$ denotes the largest $r(a)$ seen so far. Estimated number of distinct numbers is $2^R$ .

Why Flajolet-Martin works

Let $m$ denote distinct elements seen so far, and $r$ denote trailing zeros. We have the probability of finding a tail of $r$ zeros:

p (r) = {10 m ≫ 2 r m ≪ 2 r .

$p(r) = \begin{cases} 1 & m \gg 2^r \\ 0 & m \ll 2^r \end{cases}.$
Thus,

2R $2^R$ will be almost always around

m $m$ .

Reduce Error

$E[2^R]$ is actually infinite. This could be dangerous. Partitioning samples into groups is a good idea. Taking average $m = E[2^{R_i}]$ might be effected by some rediculous large $2^{R_i}$ . A good solution is taking the average of medians of $2^{R_i}$ .

Computing Moments

Suppose a stream has elements from a set $A$ of of $N$ elements. Let $m_i$ be the times value $i$ occurs in the stream. The $k$ -th moment is $\sum_{i \in A} (m_i)^k$ .

The $0$ -th moment is the number of distinct elements in stream. The $1$ -st is the count of elements in stream. $2$ -nd moments, is the surprise number $S$ , which meansures how unevent the distribution is.

AMS Method for $2$ -nd moment

Keep track of many variables $X = \{ el, val \}$ , where $el$ is item $i$ , and $val$ is the count of $i$ . Goal: $S = \sum_i m_i^2$ .

Pick random time $t$ , set $X.el = i$ , and $X.val = c$ , where $c$ is the number of $i$ s in the stream starting from $t$ . Then the estimate of the $2$ -nd moment $\sum_i m_i^2$ is

S = f (X) = n (2 c - 1) .

$S = f(X) = n (2c - 1).$

We will keep track of multiple $X$ s, and final estimate will be

S = 1 k \sum j k f (X j) .

$S = \frac{1}{k} \sum_j^k f(X_j).$

Expectation analysis

Let $c_t$ denotes times item at time $t$ appears from $t$ in stream onwards, we have:

E [f (X)] = 1 n \sum t = 1 n n \times (2 c t - 1) = 1 n \sum i n \sum j = 1 m i (2 j - 1) = \sum i m 2 i .

$E[f(X)] = \frac{1}{n} \sum_{t=1}^{n} n \times (2 c_t - 1) = \frac{1}{n} \sum_i n \sum_{j=1}^{m_i} (2j - 1) = \sum_i m_i^2.$
Thus, we have

S=E[f(X)]=∑im2i $S = E[f(X)] = \sum_i m_i^2$ .

Higher Order Moments

Replace $2c - 1$ with $c^k - (c-1)^k$ . General Estimate is

S = 1 k \sum j = 1 k n \times (c k - (c - 1) k) .

$S = \frac{1}{k} \sum_{j=1}^k n \times (c^k - (c-1)^k).$

Sampling a fixed proportion

2 different problems:

Sample a fixed proportion
Maintain a random sample of fixed size

Solution:

Pick $\frac{1}{10}$ of users and take all their searches
Use a hash function that hashes the user id into 10 buckets

Sampling a fixed-size sample

Reservoir Sampling

Store first $s$ elements to $S$ ;
With probability $\frac{s}{n}$ , keep the $n$ -th element;
If we picked the $n$ -th element, replace one in $S$ uniformly at random.

Queries over a sliding window

A useful model of stream processing is that queries are about a window of length $N$ - the $N$ most recent elements received. In practice, $N$ might be too large to store (for example, Amazon).

Bit Counting problem

Given a stream of $0$ s and $1$ s, Answer queries of the from: How many $1$ s are in the last $k$ bits? Where $k \le N$ .

Naive solution: Store most recent $N$ bits. This solution is totally fucking up when $N$ is too large.

An simple solution

Assuming $0$ s and $1$ s are uniformly distributed. There are $N \times \frac{s}{s + z}$ $1$ s in last $N$ bits, where $s$ and $z$ are the number of $1$ s and $0$ s from the beginning of the stream. But if the stream is non-uniform, the solution obviously won't work.

DGIM

Divide data stream into buckets (blocks) with specific number (size) of $1$ s. Each bit in the stream has a timestamp (index), starting $1$ , $2$ , $\dots$ . We need $O(\log_2N)$ bits to represent relative (to window, of course) timestamp, aka, timestamp modulo $N$ .

A bucket consists of:

timestamp of its end
the number of $1$ s in it

Update buckets

When new bit coming in:

Drop oldest bucket if its end time is prior to $N$ time units before.
If incoming bit is $0$ , nothing will happen.
If incoming bit is $1$ , update buckets.

To update buckets:

Create bucket with size $1$ .
If there are now 3 buckets of size 1, combine the oldest 2 into a bucket of size 2.
If there are 3 buckets of size 2, combine the oldest 2 into a bucket of size 4.
And so on....

Query

Sum the sizes of all buckets but the last, adding half size of the last bucket. Remember: We don't know how many 1s of the last bucket are in the window.

Further reducing the error

Instead of maintaining 1 or 2 buckets, we allow $r$ or $r-1$ buckets (except largest size bucket). Error is at most $O(1/r)$ .