[关闭]
@nrailgun 2015-10-02T17:11:31.000000Z 字数 4319 阅读 1634

Mining Data Streams

机器学习


Filtering Data Stream

Problem: Find if element a in data stream is in set S.

Bloom filter is a good idea. Bloom filter produce no false negative, but some false positive.

Bloom filter

Consider |S|=m, |B|=n. B is n bit array filled with 0. hi is hash function, i=1,2,,n.

Initialization

Hash each element sS using each hash hi, set B[hi(s)]=1.

Run-time

For element x in data stream, if B[hi(x)]=1 for all i, declare that x is in S, else discard it.


Counting Distinct Elements

Problem: Maintain a count of the distinct elements seen so far.

Flajolet-Martin Approach

Pick a hash function h that maps each of N elements to log2(N) bits. For each stream element a, let r(a) be the number of trailing 0s in h(a). E.g., say h(a)=12=(1100)2, so r(a)=2.

Let R=maxar(a) denotes the largest r(a) seen so far. Estimated number of distinct numbers is 2R.

Why Flajolet-Martin works

Let m denote distinct elements seen so far, and r denote trailing zeros. We have the probability of finding a tail of r zeros:

p(r)={10m2rm2r.

Thus, 2R will be almost always around m.

Reduce Error

E[2R] is actually infinite. This could be dangerous. Partitioning samples into groups is a good idea. Taking average m=E[2Ri] might be effected by some rediculous large 2Ri. A good solution is taking the average of medians of 2Ri.


Computing Moments

Suppose a stream has elements from a set A of of N elements. Let mi be the times value i occurs in the stream. The k-th moment is iA(mi)k.

The 0-th moment is the number of distinct elements in stream. The 1-st is the count of elements in stream. 2-nd moments, is the surprise number S, which meansures how unevent the distribution is.

AMS Method for 2-nd moment

Keep track of many variables X={el,val}, where el is item i, and val is the count of i. Goal: S=im2i.

Pick random time t, set X.el=i, and X.val=c, where c is the number of is in the stream starting from t. Then the estimate of the 2-nd moment im2i is

S=f(X)=n(2c1).

We will keep track of multiple Xs, and final estimate will be

S=1kjkf(Xj).

Expectation analysis

Let ct denotes times item at time t appears from t in stream onwards, we have:

E[f(X)]=1nt=1nn×(2ct1)=1ninj=1mi(2j1)=im2i.

Thus, we have S=E[f(X)]=im2i.

Higher Order Moments

Replace 2c1 with ck(c1)k. General Estimate is

S=1kj=1kn×(ck(c1)k).


Sampling a fixed proportion

2 different problems:

Solution:


Sampling a fixed-size sample

Reservoir Sampling

  1. Store first s elements to S;
  2. With probability sn, keep the n-th element;
  3. If we picked the n-th element, replace one in S uniformly at random.

Queries over a sliding window

A useful model of stream processing is that queries are about a window of length N - the N most recent elements received. In practice, N might be too large to store (for example, Amazon).

Bit Counting problem

Given a stream of 0s and 1s, Answer queries of the from: How many 1s are in the last k bits? Where kN.

Naive solution: Store most recent N bits. This solution is totally fucking up when N is too large.

An simple solution

Assuming 0s and 1s are uniformly distributed. There are N×ss+z 1s in last N bits, where s and z are the number of 1s and 0s from the beginning of the stream. But if the stream is non-uniform, the solution obviously won't work.

DGIM

Divide data stream into buckets (blocks) with specific number (size) of 1s. Each bit in the stream has a timestamp (index), starting 1, 2, . We need O(log2N) bits to represent relative (to window, of course) timestamp, aka, timestamp modulo N.

A bucket consists of:

Update buckets

When new bit coming in:

  1. Drop oldest bucket if its end time is prior to N time units before.
  2. If incoming bit is 0, nothing will happen.
  3. If incoming bit is 1, update buckets.

To update buckets:

  1. Create bucket with size 1.
  2. If there are now 3 buckets of size 1, combine the oldest 2 into a bucket of size 2.
  3. If there are 3 buckets of size 2, combine the oldest 2 into a bucket of size 4.
  4. And so on....

Query

Sum the sizes of all buckets but the last, adding half size of the last bucket. Remember: We don't know how many 1s of the last bucket are in the window.

Further reducing the error

Instead of maintaining 1 or 2 buckets, we allow r or r1 buckets (except largest size bucket). Error is at most O(1/r).

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注