@nrailgun 2015-10-02T10:43:14.000000Z 字数 1174 阅读 1638

k-NN

机器学习

Algorithm

Given Training set $T = \{(x_i, y_i) | x_i \in \mathcal{X} \subseteq \mathbf{R}^n, y_i \in \mathcal{Y} \subseteq \mathbf{R}^n\}$ , where $i = 1, 2, \dots, N$ , and distance metric.
Let $N_k(x)$ denotes $k$ Nearest Neighbors (aka, k-NN), according to given distance metric.
Decide class $y$ of $x$ according to points in $N_k(x)$ :

$y = a r g m a x c j \sum x i \in N k (x) I (y i = c i),$ $y = \mathrm{argmax}_{c_j} \sum_{x_i \in N_k(x)} I(y_i = c_i),$
where $i = 1, 2, \dots, N$ , $j = 1, 2, \dots, K$ , and $I$ denotes Indicator function.

Model

Selecting a smaller $k$ reduces approximation error, but easily effected by near noise, aka, increases estimation error. Setting $k = N$ is totally stupid, and usually selecting a smaller $k$ is a good idea.

kd tree

Linear scanning works OK, but too slow. We need kd tree. Given $T = \{ x_1, x_2, \dots, x_N \}$ , where $x_i = (x_i^{(1)}, x_i^{(2)}, \dots, x_i^{(k)})^T$ , $i = 1, 2, \dots, N$ .

For node at depth $j$ , partition $x_i$ with $x_i^{(l)}$ median for all $x_i$ in region, into 2 sub regions.
Put those $x_i$ whose $x_i^{(l)}$ less than median to the left sub-tree, others to the right sub-tree. Repeat step 1 till only one $x_i$ left.

kd tree search

TODO: The algorithm is not hard to understand, but I don't know how to implement it. I'll deal with it later.