@HaomingJiang
2016-08-15T05:20:30.000000Z
字数 4559
阅读 2601
数据挖掘导论
笔记
Extensions:
1. each record can belong to different cluster
2. using statistical distribution to build the model
3. connection restricting clusters
Fuzzy c-means
The difference from k-means: each record has a weighted vector corresponding to different cluster.
It is something like GMM
Use EM to approximate MLE.
The process of EM is similar to K-means, which is a special case of UGMM.
1. randomly initalize the parameters.
2. reassign nodes to each cluster.
3. calculate new parameters.
Pro: more greneral, simplify data, easy to describe the clusters
Con: slow, cant handle linear data, hard to choose model(can be tackled with Bayes Method), noise and outlier
Self Organizing Maps (SOM) is a neural network with a input and a hidden layer.
each node in computational layer is a vector.
1. initialization. Can use vectors from dataset
2. for each input, find a nearest node
3. update: use gradient descend. for each node , nee , topodist is the topology distance in computational layer, which is not .
different topology relation can be apply (e.g. triangle, hexagon)
GOOD VISULIZATION.
Highly rely on desity threshold
Using feature subspace.
It can use a subset of feature to cluster data.
a. find dense units in each 1-D space.
b. repeat:
c. use every (k-1)-D dense unit generate dense k-D unit
d. filter these units.
It seems like Apriori.
Need large computational resource
It is a initialization step.
It is designed for sparse data, e.g. document.
fast and simple; divide data in similar size groups.
It is be taken as first step of Chameleon.
Main property is the relative closeness and relative interconnectivity of the cluster
Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters
The merging scheme preserves self-similarity
Relative Closeness, RC desribes the absolute similarity between 2 clusters regulared with each cluster's self-similarity. It help preserves self-similarity
kNN graph's average weight
Relative Interconnectivity, RI
is the sum of edges
self-similarity can be the combination of RI and RC
Chameleon:
1. construct KNN graph
2. Use a multilevel graph partitioning algorithm on the graph to find a large number of clusters of well-connected vertices (e.g. METIS)
3. repeat: merge the group which preserves the self-similarity the best.
The generated graph called SNN similarity graph, which is sparse.
It is adaptive to the density.
Pro: good at dealing with noise and outlier and clusters with different size,density and shapes and also high-demension
Con: not all record will be clustered. it's is too brittle
similarly, we can use cluastering based on SNN density.
Spatial tactics: k-d tree, use triangle inequality to simplify the distance calculation (i.e. by using distance from the point to the current centre, and distances between centers, we can estimate the distance from the point to other centres)
Sampling method, only cluster a sample.
Devide the graph into different eperated groups.
Parallel Computing.