[关闭]
@twoer2 2020-01-26T18:00:30.000000Z 字数 28752 阅读 633

Modelling and Visualization of High Dimensional Data

course

Curse of Dimensionality and Dimension Reduction

Facts:

In practice, the curse of dimensionality means that, for a given sample size, there is a maximum number of features above which the performance of our classifier will degrade rather than improve:

curse of dimensionality

In most cases, the additional information that is lost by discarding some features is (more than) compensated by a more accurate mapping in the lower-dimensional space.

To conquer the curse of dimensionality, we often need dimension reduction. In fact, data in high dimensional space is not uniformly distributed. Often data is distributed in a lower dimension but presented in the high dimension. Recovering such a low-dimensional structure is a key to modelling high dimensional data in its very nature.

There are two methodologies of dimensionality reduction:

However, reducing dimension leads to the problem of losing information.

PCA

Basic Algorithm

The goal of PCA (Principal Component Analysis) is to project the high-dimensional data onto a group of orthogonal unit vectors, where the variance of the projections is maximized. These vectors are called principal components.

Given a dataset of data points in a -dimensional space, .

where . Since is centralized, . Let ,

where is the empirical covariance matrix. (or for unbiased estimation)

where is the Lagrange multiplier.

so we should choose the eigenvector with the largest eigenvalue as to maximize the variance.

Let . If is the 2nd principal component, it should follow the constraints that and are orthogonal, i.e. .

where and are Lagrange multipliers.

By Multiplying on the left, we get

Note that

and

Therefore,

so we could choose the eigenvector with the second largest eigenvalue as to maximize the variance. The choice of is the 2nd principal component.

Summary:

Dual Algorithm

For a () matrix, , is a matrix. It is often computationally infeasible in solve its eigenvalue problem. We can construct a matrix , which has

Then, we apply SVD (Singular Value Decomposition) on

Then transform into

It is obvious to see that

Since is an diagonal matrix, the column vectors of are eigenvectors of . Select first () columns of (which is the first principal components) to form a project matrix .

Summary:

Proportion of Variance

In practice, we can use Proportion of Variance (PoV) to determine an appropriate dimensionality in the PCA space:

When PoV >= 90%, the corresponding will be assigned to be .

Limitations of the standard PCA

LDA

Linear Discriminant Analysis (LDA) is a method for high-dimensional analysis in the supervised learning paradigm as class labels are available in a dataset. The objective of LDA is to perform dimensionality reduction while maximizing the difference between the projected means of different classes and minimizing the variance of projections of each class.

Two Classes

Assume we have a set of -dimensional sample . of which belong to class , and to class . We seek to obtain a scalar by projecting the samples onto a line

The mean vector of each class in and feature space is

The difference between the projected means is

The matrix is called the between-class scatter. Similarly, the variance of each class (called the scatter) is

where

The matrix is called the within-class scatter.

The criterion function (Fisher criterion) is defined as

To find the maximum of , we derive and equate to zero

is an eigenvector of .

Note that and are scalars, so and are in the same direction. As well, the scaling of does not affect, so we can set

Summary:

where

Multiple Class

Denote the number of classes as . LDA can only produce at most feature projections. The projection matrix

The generalization of the within-class scatter is

where and

The generalization for the between-class scatter is

where

The total scatter matrix is defined as

Since , we can get

Similarly, we can get

Therefore,

(However, is not important is the algorithm of multiple class LDA. You can omit it)

Similarly, we define the mean vector and scatter matrices for the projected samples as

Since the projection is no long a scalar (it has dimensions), we then use the determinant of the scatter matrices to obtain a scalar objective function:

It can be shown that the optimal projection matrix is the one whose columns are the eigenvectors corresponding to the largest eigenvalues of the following generalized eigenvalue problem

Note that in , each is the product of two vectors, so is the sum of matrices of rank one or less, and therefore .

Ref: Rank Wiki. , .

At the mean time, , are linearly dependent, at least one could be represented by the others.

Note that , where , . Therefore,

where is the rank of . Since it has at most linearly independent columns, its rank is at most . Therefore, . This means that at most of the eigenvalues will be non-zero, and therefore LDA could reduce the samples to at most dimensions.

Summary:

where

where

Limitations of LDA

SOM

Self-Organizing Map (SOM) is a biologically inspired unsupervised neural network that approximates an unlimited number of input data by a finite set of nodes arranged in a low-dimensional grid (one or two-dimensional), where neighbor nodes correspond to more similar input data.

Each neuron in a SOM (an one or two-dimensional grid) is assigned a weight vector with the same dimensionality as the input space (or, weight space). What we do is like updating these weights to make this grid "cover" all samples. The updating strategy is as following:

updating weights

Summary:

U-Matrix

For weight space of dimension less than 4, neurons could be visualized by their corresponding weight vectors. For high-dimensional data, a unified distance matrix (U-matrix) is constructed to facilitate the visualization. The shape of one kind of U-matrix is the same as the SOM network. The corresponding value in the U-matrix of the neuron is as the following:

where is the set of neighbors of . Distance between the weights of neighboring neurons gives an approximation of the distance between different parts of the underlying data. Dipict U-matrix in an image (heat map), similar colors depict the closely spaced nodes and distinct colors indicate the more distant nodes. Groups of similar colors can be considered as clusters, and the contrast parts as the boundary regions.

U-matrix

MDS

Multi-dimensional scaling (MDS) maps the distance between observations in a high dimensional space into a lower dimensional space. The goal is to find a configuration of points in the low dimensional space whose inter-point distance preserves dissimilarities of corresponding points in the higher dimension.

Iterative

Given points in dimensions, the distance between points and is . We want to find points in 2 (or 3) dimensions s.t. the distance between and is close to . We can look for by minimizing an objective function. The following are three possible objective functions (Stress):

Note that the s are constant. penalizes large absolute errors. penalizes large relative errors. is a compromise between the two before.

For each low dimensional point , update it by:

Summary:

Analytic

Given the distance matrix in the high dimensional space, denote the low dimensional embeddings as and the low dimensional distance . It is expected that . Therefore,

where the matrix is the inner product matrix of embeddings. To simplify, we can centralize the embeddings , i.e. . Obviously, . Therefore,

where is the trace of , i.e. . So, we can get

If we let , we can derive that , where and .

After we derive the matrix , what we need next is to apply eigenvalue decomposition on : . Suppose there are non-zero eigenvalues in , they form the diagnoal matrix , and let denotes the corresponding eigenvectors, then can be represented as:

If we do not need strictly and expect to reduce dimensionality effectively, we can select a to construct .

Summary:

MDS Family

ISOMAP

Isometric feature mapping (ISOMAP) is a non-linear MDS method for modelling manifold structure appearing in a high dimensional space. The ISOMAP preserves the geodesic distances between points, while the (normal) MDS preserves the Euclidean distances.

geodesic distance

(The left figure depicts the Euclidean distance of two points, while the right figure depicts the geodesic distance between them (the red line))

For neighboring samples, the Euclidean distance provides a good approximation to geodesic distance. For distant points, the geodesic distance can be approximated with a sequence of steps between clusters of neighboring points. Therefore, the first step of the algorithm is to determine which points are neighbors in the manifold, based on the distance in the input space . The set of neighbors can be determined in two different ways:

The first method may cause problems if the samples are sparse. The second one might be affected by outliers. These neighborhood relations are represented as a weighted graph . Then, we can estimate the geodesic distances between all pair of points on the maniford by computing their shortest path distances in the graph . This can be performed, e.g. using Dijkstra's Algorithm, Floyd–Warshall algorithm. The distance matrix is then inputted to the classical MDS algorithm (analytical solution).

Summary:

Limitations of ISOMAP

LLE

Locally Linear Embedding (LLE) is another non-linear method for modelling manifold structure. It models the local geometry of samples in the high dimensional space, and expects that after projecting to the low dimensional space they can keep the local geometry. The local geometry is modeled by linear weights that reconstruct each data point as a linear combination of its neighbors.

LLE

For each points

where the weight measures the contribution of the j-th sample to the reconstruction of the i-th sample. If is not the neighbor of , . These weights are computed by minimizing the reconstruction error

The reconstruction error is equivalent to

where , . using a Lagrange multiplier to enforce the constraint that :

Compute the derivative w.r.t :

Since , we can set and then

(Be careful that is a column vector. However, in the weight matrix , those weights related to are at the i-th row, because it is that denotes the weight of to , but not ).

The embedding vectors preserves the local geometry:

The cost is equivalent to

where

and is 1 if and 0 otherwise. It can be shown that

Since coordinates can be translated without affecting the cost function, we remove this degree of freedom by imposing that they are centered, i.e. . To avoid degenerate solutions, we constraint the embedding vectors to have unit covariance matrix, i.e. . According to the Rayleitz-Ritz theorem, the optimal embedding is found by computing the bottom eigenvectors. The bottom eigenvector is discarded and the remained eigenvectors form the embedding coordinates.

Summary:

Limitations of LLE

Reference

SVD 1
SVD 2
PCA
LLE

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注