k-NN
机器学习
Algorithm
Given Training set T={(xi,yi)|xi∈X⊆Rn,yi∈Y⊆Rn}, where i=1,2,…,N, and distance metric.
Let Nk(x) denotes k Nearest Neighbors (aka, k-NN), according to given distance metric.
Decide class y of x according to points in Nk(x):
y=argmaxcj∑xi∈Nk(x)I(yi=ci),
where i=1,2,…,N, j=1,2,…,K, and I denotes Indicator function.
Model
Selecting a smaller k reduces approximation error, but easily effected by near noise, aka, increases estimation error. Setting k=N is totally stupid, and usually selecting a smaller k is a good idea.
kd tree
Linear scanning works OK, but too slow. We need kd tree. Given T={x1,x2,…,xN}, where xi=(x(1)i,x(2)i,…,x(k)i)T, i=1,2,…,N.
- For node at depth j, partition xi with x(l)i median for all xi in region, into 2 sub regions.
- Put those xi whose x(l)i less than median to the left sub-tree, others to the right sub-tree. Repeat step 1 till only one xi left.
kd tree search
TODO: The algorithm is not hard to understand, but I don't know how to implement it. I'll deal with it later.