This could be a good idea today. We define Jarccard distance sim(C1,C2) as:
sim(C1,C2)=|C1∩C2||C1∪C2|,
where C1 and C2 are sets of dimensions.
Finding Similar Documents
Goal: Given a large documents set, find near duplicated pairs.
Applications:
Similar news cluster
Similar mirror website cluster
Problems:
Too many pairs to compare.
Documents usually too large to fit in memory.
3 Essential Steps for Similar Doc's
Step
Function
Shingling
Convert documents to dimension sets.
Min-Hashing
Convert large sets to short signature, preserving similarity.
Locality-Sensitive Hashing
Focus on pairs of signatures likely to be from similar documents.
Big Picture:
Shingling
A k-shingle (or k-gram) is a sequence of k tokens that appears in a document. Tokens can be words, charaters, or whatever you need. Assuming tokens to be characters in this note. For example, let k=2, and document D1=abcab. Set of k-shingles S(D1)={ab,bc,ca}.
To compress long shingles, We can hash them to (say) 4 bytes. Another nice effect of comressing is that now it's faster to compare between shingles. For example, you can hash S(D1)={ab,bc,ca} to h(D1)={1,5,7}.
Equivalently, each document is a 0/1 vector in the space of k-shingles, where each unique k-shingle is a dimension.
Thumb rule: k=5 is good for short documents, while k=10 is better for long documents.
Min-Hashing
Suppose we need to find near duplicated documents among N documents. Naively computing Jaccard similarity pairwise needs N(N−1) comparisons. This is too slow.
Let C denotes a K×N boolean matrix, where N is the number of documents, and let Ci denotes the i-th column of C. Each column Ci, a boolean vector, represents corresponding k-shingle.
We need a hashing algorithm h(Ci) which preserves similarity, such that:
h(Ci)=h(Cj), if sim(Ci,Cj) is high;
h(Ci)≠h(Cj), if sim(Ci,Cj) is low.
Define Min-Hashing function hπ(Ci) as:
hπ(Ci)=minπ(C)
where π is a random permutation vector. I guess you might consider π(C) as boolean indexing in MATLAB. Applying hπ to C produces 1×N row vector. We apply Kπ to C and produce K×N matrix M.
Locality-Sensitive Hashing
The goal of LSH is to find documents with Jaccard similarity at least s (say, s=0.8). Say, the columns x and y of M are a candidate pair if M(i,x)=M(i,y) for at least s of value i.
We divide M into b bands of r rows. Candidate pairs are those columns that hash to the same bucket for at least 1 bands.
The probability C1 and C2 are identical in one band is sr, where s=sim(C1,C2). The probability C1 and C2 are not identical in all b bands is (1−sr)b. Picking a larger r gives less false positive (more sound), while picking a larger b gives less false negative (more complete).