Mining Data Streams
机器学习
Filtering Data Stream
Problem: Find if element a in data stream is in set S.
Bloom filter is a good idea. Bloom filter produce no false negative, but some false positive.
Bloom filter
Consider |S|=m, |B|=n. B is n bit array filled with 0. hi is hash function, i=1,2,…,n.
Initialization
Hash each element s∈S using each hash hi, set B[hi(s)]=1.
Run-time
For element x in data stream, if B[hi(x)]=1 for all i, declare that x is in S, else discard it.
Counting Distinct Elements
Problem: Maintain a count of the distinct elements seen so far.
Flajolet-Martin Approach
Pick a hash function h that maps each of N elements to log2(N) bits. For each stream element a, let r(a) be the number of trailing 0s in h(a). E.g., say h(a)=12=(1100)2, so r(a)=2.
Let R=maxar(a) denotes the largest r(a) seen so far. Estimated number of distinct numbers is 2R.
Why Flajolet-Martin works
Let m denote distinct elements seen so far, and r denote trailing zeros. We have the probability of finding a tail of r zeros:
p(r)={10m≫2rm≪2r.
Thus,
2R will be almost always around
m.
Reduce Error
E[2R] is actually infinite. This could be dangerous. Partitioning samples into groups is a good idea. Taking average m=E[2Ri] might be effected by some rediculous large 2Ri. A good solution is taking the average of medians of 2Ri.
Computing Moments
Suppose a stream has elements from a set A of of N elements. Let mi be the times value i occurs in the stream. The k-th moment is ∑i∈A(mi)k.
The 0-th moment is the number of distinct elements in stream. The 1-st is the count of elements in stream. 2-nd moments, is the surprise number S, which meansures how unevent the distribution is.
AMS Method for 2-nd moment
Keep track of many variables X={el,val}, where el is item i, and val is the count of i. Goal: S=∑im2i.
Pick random time t, set X.el=i, and X.val=c, where c is the number of is in the stream starting from t. Then the estimate of the 2-nd moment ∑im2i is
S=f(X)=n(2c−1).
We will keep track of multiple Xs, and final estimate will be
S=1k∑jkf(Xj).
Expectation analysis
Let ct denotes times item at time t appears from t in stream onwards, we have:
E[f(X)]=1n∑t=1nn×(2ct−1)=1n∑in∑j=1mi(2j−1)=∑im2i.
Thus, we have
S=E[f(X)]=∑im2i.
Higher Order Moments
Replace 2c−1 with ck−(c−1)k. General Estimate is
S=1k∑j=1kn×(ck−(c−1)k).
Sampling a fixed proportion
2 different problems:
- Sample a fixed proportion
- Maintain a random sample of fixed size
Solution:
- Pick 110 of users and take all their searches
- Use a hash function that hashes the user id into 10 buckets
Sampling a fixed-size sample
Reservoir Sampling
- Store first s elements to S;
- With probability sn, keep the n-th element;
- If we picked the n-th element, replace one in S uniformly at random.
Queries over a sliding window
A useful model of stream processing is that queries are about a window of length N - the N most recent elements received. In practice, N might be too large to store (for example, Amazon).
Bit Counting problem
Given a stream of 0s and 1s, Answer queries of the from: How many 1s are in the last k bits? Where k≤N.
Naive solution: Store most recent N bits. This solution is totally fucking up when N is too large.
An simple solution
Assuming 0s and 1s are uniformly distributed. There are N×ss+z 1s in last N bits, where s and z are the number of 1s and 0s from the beginning of the stream. But if the stream is non-uniform, the solution obviously won't work.
DGIM
Divide data stream into buckets (blocks) with specific number (size) of 1s. Each bit in the stream has a timestamp (index), starting 1, 2, …. We need O(log2N) bits to represent relative (to window, of course) timestamp, aka, timestamp modulo N.
A bucket consists of:
- timestamp of its end
- the number of 1s in it
Update buckets
When new bit coming in:
- Drop oldest bucket if its end time is prior to N time units before.
- If incoming bit is 0, nothing will happen.
- If incoming bit is 1, update buckets.
To update buckets:
- Create bucket with size 1.
- If there are now 3 buckets of size 1, combine the oldest 2 into a bucket of size 2.
- If there are 3 buckets of size 2, combine the oldest 2 into a bucket of size 4.
- And so on....
Query
Sum the sizes of all buckets but the last, adding half size of the last bucket. Remember: We don't know how many 1s of the last bucket are in the window.
Further reducing the error
Instead of maintaining 1 or 2 buckets, we allow r or r−1 buckets (except largest size bucket). Error is at most O(1/r).