@nrailgun 2015-09-16T10:56:45.000000Z 字数 2861 阅读 1874

Finding Similar Items

机器学习

Distance Metric

Goal

Find near neighbors in high dimensional space.

Jaccard similarity

This could be a good idea today. We define Jarccard distance $sim(C_1, C_2)$ as:

s i m (C 1, C 2) = | C 1 \cap C 2 | | C 1 \cup C 2 |,

$sim(C_1, C_2) = \frac{ | C_1 \cap C_2 | }{ | C_1 \cup C_2 | } ,$
where

C1 $C_1$ and

C2 $C_2$ are sets of dimensions.

Finding Similar Documents

Goal: Given a large documents set, find near duplicated pairs.
Applications:
- Similar news cluster
- Similar mirror website cluster
Problems:
- Too many pairs to compare.
- Documents usually too large to fit in memory.

3 Essential Steps for Similar Doc's

Step	Function
Shingling	Convert documents to dimension sets.
Min-Hashing	Convert large sets to short signature, preserving similarity.
Locality-Sensitive Hashing	Focus on pairs of signatures likely to be from similar documents.

Big Picture:

Shingling

A k-shingle (or k-gram) is a sequence of $k$ tokens that appears in a document. Tokens can be words, charaters, or whatever you need. Assuming tokens to be characters in this note. For example, let $k = 2$ , and document $D_1 = \mathrm{abcab}$ . Set of k-shingles $S(D_1) = \{ \mathrm{ab}, \mathrm{bc}, \mathrm{ca} \}$ .

To compress long shingles, We can hash them to (say) 4 bytes. Another nice effect of comressing is that now it's faster to compare between shingles. For example, you can hash $S(D_1) = \{ \mathrm{ab}, \mathrm{bc}, \mathrm{ca} \}$ to $h(D_1) = \{ 1, 5, 7 \}$ .

Equivalently, each document is a $0/1$ vector in the space of k-shingles, where each unique k-shingle is a dimension.

Thumb rule: $k = 5$ is good for short documents, while $k = 10$ is better for long documents.

Min-Hashing

Suppose we need to find near duplicated documents among $N$ documents. Naively computing Jaccard similarity pairwise needs $N(N-1)$ comparisons. This is too slow.

Let $C$ denotes a $K \times N$ boolean matrix, where $N$ is the number of documents, and let $C_i$ denotes the $i$ -th column of $C$ . Each column $C_i$ , a boolean vector, represents corresponding k-shingle.

We need a hashing algorithm $h(C_i)$ which preserves similarity, such that:

$h(C_i) = h(C_j)$ , if $sim(C_i, C_j)$ is high;
$h(C_i) \not= h(C_j)$ , if $sim(C_i, C_j)$ is low.

Define Min-Hashing function $h_\pi(C_i)$ as:

h π (C i) = min π (C)

$h_\pi(C_i) = \min \pi(C)$
where

π $\pi$ is a random permutation vector. I guess you might consider

π(C) $\pi(C)$ as boolean indexing in MATLAB. Applying

hπ $h_\pi$ to

C $C$ produces

1×N $1 \times N$ row vector. We apply

K $K$

π $\pi$ to

C $C$ and produce

K×N $K \times N$ matrix

M $M$ .

Locality-Sensitive Hashing

The goal of LSH is to find documents with Jaccard similarity at least $s$ (say, $s = 0.8$ ). Say, the columns $x$ and $y$ of $M$ are a candidate pair if $M(i, x) = M(i, y)$ for at least $s$ of value i.

We divide $M$ into $b$ bands of $r$ rows. Candidate pairs are those columns that hash to the same bucket for at least $1$ bands.

The probability $C_1$ and $C_2$ are identical in one band is $s^r$ , where $s = sim(C_1, C_2)$ . The probability $C_1$ and $C_2$ are not identical in all $b$ bands is $(1 - s^r)^b$ . Picking a larger $r$ gives less false positive (more sound), while picking a larger $b$ gives less false negative (more complete).