@twoer2
2020-01-26T18:00:30.000000Z
字数 28752
阅读 633
course
Facts:
In practice, the curse of dimensionality means that, for a given sample size, there is a maximum number of features above which the performance of our classifier will degrade rather than improve:
In most cases, the additional information that is lost by discarding some features is (more than) compensated by a more accurate mapping in the lower-dimensional space.
To conquer the curse of dimensionality, we often need dimension reduction. In fact, data in high dimensional space is not uniformly distributed. Often data is distributed in a lower dimension but presented in the high dimension. Recovering such a low-dimensional structure is a key to modelling high dimensional data in its very nature.
There are two methodologies of dimensionality reduction:
However, reducing dimension leads to the problem of losing information.
The goal of PCA (Principal Component Analysis) is to project the high-dimensional data onto a group of orthogonal unit vectors, where the variance of the projections is maximized. These vectors are called principal components.
Given a dataset of data points in a -dimensional space, .
where . Since is centralized, . Let ,
where is the empirical covariance matrix. (or for unbiased estimation)
where is the Lagrange multiplier.
so we should choose the eigenvector with the largest eigenvalue as to maximize the variance.
Let . If is the 2nd principal component, it should follow the constraints that and are orthogonal, i.e. .
where and are Lagrange multipliers.
By Multiplying on the left, we get
Note that
and
Therefore,
so we could choose the eigenvector with the second largest eigenvalue as to maximize the variance. The choice of is the 2nd principal component.
Summary:
For a () matrix, , is a matrix. It is often computationally infeasible in solve its eigenvalue problem. We can construct a matrix , which has
Then, we apply SVD (Singular Value Decomposition) on
Then transform into
It is obvious to see that
Since is an diagonal matrix, the column vectors of are eigenvectors of . Select first () columns of (which is the first principal components) to form a project matrix .
Summary:
In practice, we can use Proportion of Variance (PoV) to determine an appropriate dimensionality in the PCA space:
When PoV >= 90%, the corresponding will be assigned to be .
Linear Discriminant Analysis (LDA) is a method for high-dimensional analysis in the supervised learning paradigm as class labels are available in a dataset. The objective of LDA is to perform dimensionality reduction while maximizing the difference between the projected means of different classes and minimizing the variance of projections of each class.
Assume we have a set of -dimensional sample . of which belong to class , and to class . We seek to obtain a scalar by projecting the samples onto a line
The mean vector of each class in and feature space is
The difference between the projected means is
The matrix is called the between-class scatter. Similarly, the variance of each class (called the scatter) is
where
The matrix is called the within-class scatter.
The criterion function (Fisher criterion) is defined as
To find the maximum of , we derive and equate to zero
is an eigenvector of .
Note that and are scalars, so and are in the same direction. As well, the scaling of does not affect, so we can set
Summary:
where
Denote the number of classes as . LDA can only produce at most feature projections. The projection matrix
The generalization of the within-class scatter is
where and
The generalization for the between-class scatter is
where
The total scatter matrix is defined as
Since , we can get
Similarly, we can get
Therefore,
(However, is not important is the algorithm of multiple class LDA. You can omit it)
Similarly, we define the mean vector and scatter matrices for the projected samples as
Since the projection is no long a scalar (it has dimensions), we then use the determinant of the scatter matrices to obtain a scalar objective function:
It can be shown that the optimal projection matrix is the one whose columns are the eigenvectors corresponding to the largest eigenvalues of the following generalized eigenvalue problem
Note that in , each is the product of two vectors, so is the sum of matrices of rank one or less, and therefore .
Ref: Rank Wiki. , .
At the mean time, , are linearly dependent, at least one could be represented by the others.
Note that , where , . Therefore,
where is the rank of . Since it has at most linearly independent columns, its rank is at most . Therefore, . This means that at most of the eigenvalues will be non-zero, and therefore LDA could reduce the samples to at most dimensions.
Summary:
where
where
Compute the optimal projection matrix , where is the k-th eigenvector of .
Apply the projection to samples
Self-Organizing Map (SOM) is a biologically inspired unsupervised neural network that approximates an unlimited number of input data by a finite set of nodes arranged in a low-dimensional grid (one or two-dimensional), where neighbor nodes correspond to more similar input data.
Each neuron in a SOM (an one or two-dimensional grid) is assigned a weight vector with the same dimensionality as the input space (or, weight space). What we do is like updating these weights to make this grid "cover" all samples. The updating strategy is as following:
Summary:
Repeat until convergence
where is the learning rate related to the number of iteration : . The is the neighborhood kernel function, where is the Manhattan distance between neurons and and is the neighborhood size. are hyper-parameters.
For weight space of dimension less than 4, neurons could be visualized by their corresponding weight vectors. For high-dimensional data, a unified distance matrix (U-matrix) is constructed to facilitate the visualization. The shape of one kind of U-matrix is the same as the SOM network. The corresponding value in the U-matrix of the neuron is as the following:
where is the set of neighbors of . Distance between the weights of neighboring neurons gives an approximation of the distance between different parts of the underlying data. Dipict U-matrix in an image (heat map), similar colors depict the closely spaced nodes and distinct colors indicate the more distant nodes. Groups of similar colors can be considered as clusters, and the contrast parts as the boundary regions.
Multi-dimensional scaling (MDS) maps the distance between observations in a high dimensional space into a lower dimensional space. The goal is to find a configuration of points in the low dimensional space whose inter-point distance preserves dissimilarities of corresponding points in the higher dimension.
Given points in dimensions, the distance between points and is . We want to find points in 2 (or 3) dimensions s.t. the distance between and is close to . We can look for by minimizing an objective function. The following are three possible objective functions (Stress):
Note that the s are constant. penalizes large absolute errors. penalizes large relative errors. is a compromise between the two before.
For each low dimensional point , update it by:
Summary:
Given the distance matrix in the high dimensional space, denote the low dimensional embeddings as and the low dimensional distance . It is expected that . Therefore,
where the matrix is the inner product matrix of embeddings. To simplify, we can centralize the embeddings , i.e. . Obviously, . Therefore,
where is the trace of , i.e. . So, we can get
If we let , we can derive that , where and .
After we derive the matrix , what we need next is to apply eigenvalue decomposition on : . Suppose there are non-zero eigenvalues in , they form the diagnoal matrix , and let denotes the corresponding eigenvectors, then can be represented as:
If we do not need strictly and expect to reduce dimensionality effectively, we can select a to construct .
Summary:
Isometric feature mapping (ISOMAP) is a non-linear MDS method for modelling manifold structure appearing in a high dimensional space. The ISOMAP preserves the geodesic distances between points, while the (normal) MDS preserves the Euclidean distances.
(The left figure depicts the Euclidean distance of two points, while the right figure depicts the geodesic distance between them (the red line))
For neighboring samples, the Euclidean distance provides a good approximation to geodesic distance. For distant points, the geodesic distance can be approximated with a sequence of steps between clusters of neighboring points. Therefore, the first step of the algorithm is to determine which points are neighbors in the manifold, based on the distance in the input space . The set of neighbors can be determined in two different ways:
The first method may cause problems if the samples are sparse. The second one might be affected by outliers. These neighborhood relations are represented as a weighted graph . Then, we can estimate the geodesic distances between all pair of points on the maniford by computing their shortest path distances in the graph . This can be performed, e.g. using Dijkstra's Algorithm, Floyd–Warshall algorithm. The distance matrix is then inputted to the classical MDS algorithm (analytical solution).
Summary:
Locally Linear Embedding (LLE) is another non-linear method for modelling manifold structure. It models the local geometry of samples in the high dimensional space, and expects that after projecting to the low dimensional space they can keep the local geometry. The local geometry is modeled by linear weights that reconstruct each data point as a linear combination of its neighbors.
For each points
where the weight measures the contribution of the j-th sample to the reconstruction of the i-th sample. If is not the neighbor of , . These weights are computed by minimizing the reconstruction error
The reconstruction error is equivalent to
where , . using a Lagrange multiplier to enforce the constraint that :
Compute the derivative w.r.t :
Since , we can set and then
(Be careful that is a column vector. However, in the weight matrix , those weights related to are at the i-th row, because it is that denotes the weight of to , but not ).
The embedding vectors preserves the local geometry:
The cost is equivalent to
where
and is 1 if and 0 otherwise. It can be shown that
Since coordinates can be translated without affecting the cost function, we remove this degree of freedom by imposing that they are centered, i.e. . To avoid degenerate solutions, we constraint the embedding vectors to have unit covariance matrix, i.e. . According to the Rayleitz-Ritz theorem, the optimal embedding is found by computing the bottom eigenvectors. The bottom eigenvector is discarded and the remained eigenvectors form the embedding coordinates.
Summary: