@xxh 2015-10-04T23:13:57.000000Z 字数 10433 阅读 196

linear regression

ml

[]

$\begin{bmatrix} \end{bmatrix}$

linear regression model

multiple input dimension: nultiple linear regression
$y_n \approx f(\mathbf{x_n})_{estimate} := \begin{bmatrix}1 & x_{n1}&x_{n2}&\cdots&x_{nD} \end{bmatrix}\begin{bmatrix}\beta_0 \\ \beta_1 \\ \cdots \\ \beta_D\end{bmatrix}$
$= \mathbf{x_n^T \beta}$
considering N input data sample:

$\mathbf{y} = \begin{bmatrix}y_0 \\ y_1 \\ \vdots \\ y_N \end{bmatrix} \approx f(\mathbf{x_n})_{estimate}$

$:= \begin{bmatrix}1 & \mathbf{x_{1}} & \mathbf{x_{2}} & \cdots & \mathbf{x_{D}} \end{bmatrix} \begin{bmatrix}\beta_0 \\ \beta_1 \\ \cdots \\ \beta_D\end{bmatrix}$

$= \begin{bmatrix}1 & x_{11} & x_{12} & \cdots & x_{1D} \\1 & x_{21} & x_{22} & \cdots & x_{2D} \\ 1 & \vdots & \vdots & \vdots & \vdots \\ 1 & x_{N1} & x_{N2} & \cdots & x_{ND} \end{bmatrix} \begin{bmatrix}\beta_0 \\ \beta_1 \\ \cdots \\ \beta_D\end{bmatrix}$

$= \mathbf{\widetilde{X}_n^T}\mathbf{\beta}$

consider the residual error:
$\mathbf{y}$

$= \begin{bmatrix}y_0 \\ y_1 \\ \vdots \\ y_N \end{bmatrix}$

$= \mathbf{\widetilde{X}_n^T}\mathbf{\beta} + \mathbf{\epsilon}$

Assume the error in Gaussian distribution ?? since that we do not know the exact error

$\epsilon \sim \mathcal{N}(\mu,\sigma^2)$

rewrite the linear regression model:
for the ith data sample:
$p(y_i|\mathbf{x}_i,\theta) = p(y|\mathbf{x}_i,(\mathbf{w},\sigma^2_i))$
$= \mathcal{N}(\mu(\mathbf{x}_i),\sigma^2(\mathbf{x}_i))$

$= \mathcal{N}(\mathbf{w}^T\mathbf{x}_i,\sigma^2(\mathbf{x}_i))$

in our notation:
$p(y|\mathbf{x},\theta)=p(y|\mathbf{x},(\mathbf{w},\sigma^2))=\mathcal{N}(\mathbf{w^T X},\mathbf{\epsilon}^2(\mathbf{x})) = \mathcal{N}(\mathbf{\widetilde{X}_n^T}\mathbf{\beta},\mathbf{\epsilon}^2(\mathbf{x}))$

i.e :

$p \begin{smallmatrix}(\begin{bmatrix}y_0 \\ y_1 \\ \vdots \\ y_N \end{bmatrix}|\begin{bmatrix}1 & x_{11} & x_{12} & \cdots & x_{1D} \\1 & x_{21} & x_{22} & \cdots & x_{2D} \\ 1 & \vdots & \vdots & \vdots & \vdots \\ 1 & x_{N1} & x_{N2} & \cdots & x_{ND} \end{bmatrix}^T,(\begin{bmatrix}\beta_0 \\ \beta_1 \\ \cdots \\ \beta_D\end{bmatrix},\begin{bmatrix}\epsilon_0 \\ \epsilon_1 \\ \cdots \\ \\ \epsilon_D\end{bmatrix} ^2)) \end{smallmatrix}$

$= \mathcal{N} \begin{smallmatrix}(\begin{bmatrix}1 & x_{11} & x_{12} & \cdots & x_{1D} \\1 & x_{21} & x_{22} & \cdots & x_{2D} \\ 1 & \vdots & \vdots & \vdots & \vdots \\ 1 & x_{N1} & x_{N2} & \cdots & x_{ND} \end{bmatrix} \begin{bmatrix}\beta_0 \\ \beta_1 \\ \cdots \\ \beta_D\end{bmatrix}, \begin{bmatrix}\epsilon_0 \\ \epsilon_1 \\ \cdots \\ \epsilon_D\end{bmatrix}^2 \end{smallmatrix}$

for basis function expansion:
\phi(\mathbf{x})
$p(y|\mathbf{x},\theta)=p(y|\mathbf{x},(\mathbf{w},\sigma^2,\phi))= \mathcal{N}(\mathbf{w^T \phi(\mathbf{x})},\mathbf{\epsilon}^2(\mathbf{x}))$

$= \mathcal{N}(\mathbf{\phi(\widetilde{X}_n^T)}\mathbf{\beta},\mathbf{\epsilon}^2(\mathbf{x}))$

$\phi(\mathbf{x}) =[1, x,x^2\cdots,x^d]$
$\phi(\widetilde{X}_n^T) ?=[1, \widetilde{X}_n^T,(\widetilde{X}_n^T)^2\cdots,(\widetilde{X}_n^T)^d]$

cost function

automatic way to define loss function?

Two desirable properties of cost functions
1.+ve and -ve errors should be penalized equally.
2.penalize "large" mistakes and "very large" mistakes almost equally.
Statistical vs computational(Convexity) tradeoff:
cost function $\mathcal{L}(\beta) = \mathcal{L}(\begin{bmatrix}\beta_0 \\ \beta_1 \\ \cdots \\ \beta_D\end{bmatrix}) = \mathcal{L}(\beta_0 ,\beta_1,\cdots, \beta_D)$
- MAE := $\Sigma_{n = 1,2,..N} \|y_n-f(\mathbf{x_n})_{estimate}\|$ = $\Sigma_{n = 1,2,..N} \| \mathbf{\epsilon_n} \| = \| \epsilon_1 \| + \| \epsilon_2 \| + ... + \| \epsilon_n \|$
- MSE:
- Huber loss:
- Tufey's bisquare loss: defined interms of gradients
goal, find $\beta^*$ such that $\mathcal{L}(\beta)$ reach minimum, noticed that $\beta \in \mathcal{R}^{D+1}$
- an Unconstrained Optimization Problem
  - existence of optimal solution?
  - characterizetion of optimal solution
  - algo for computing the optimal solution
- method: Grid search; gradient descend; least square

GRID SEARCH

extremely simple
works for any knid of loss

- but high exponential computational complexity

BATCH GRADIENT DESCENT

one parameter: (singularle variable function optimization) $\beta^{(k+1)} \gets \beta^{(k)} - \alpha * \frac{\partial \mathcal{L}(\beta)^{(k)}}{\partial \beta}$
multi-parameter:(multivariable function optimization) $\mathbf{\beta}^{(k+1)} \gets \mathbf{\beta}^{(k)} - \alpha \frac{\partial \mathcal{L}(\mathbf{\beta})^{(k)}}{\partial \mathbf{\beta}}$
= $\mathbf{\beta}^{(k)} - \alpha \nabla \mathcal{L}(\mathbf{\beta})^{(k)}$

$= \begin{bmatrix}\beta_0 \\ \beta_1 \\ \cdots \\ \beta_D\end{bmatrix}^{(k)} - \alpha \nabla \mathcal{L}(\begin{bmatrix}\beta_0 \\ \beta_1 \\ \cdots \\ \beta_D\end{bmatrix}^{(k)})$

$= \begin{bmatrix}\beta_0 \\ \beta_1 \\ \cdots \\ \beta_D\end{bmatrix}^{(k)} - \alpha \nabla \mathcal{L}(\beta_0^{(k)} ,\beta_1^{(k)},\cdots, \beta_D^{(k)})$

$= \begin{bmatrix}\beta_0^{(k)} \\ \beta_1^{(k)} \\ \cdots \\ \beta_D^{(k)}\end{bmatrix} - \alpha\begin{bmatrix}\frac{\partial \mathcal{L}(\beta_0^{(k)} ,\beta_1^{(k)},\cdots, \beta_D^{(k)})}{\partial \beta_0} \\ \frac{\partial \mathcal{L}(\beta_0^{(k)} ,\beta_1^{(k)},\cdots, \beta_D^{(k)})^{(k)}}{\partial \beta_1} \\ \frac{\partial \mathcal{L}(\beta_0^{(k)} ,\beta_1^{(k)},\cdots, \beta_D^{(k)})}{\partial \beta_2} \\ \vdots \\ \frac{\partial \mathcal{L}(\beta_0^{(k)} ,\beta_1^{(k)},\cdots, \beta_D^{(k)})}{\partial \beta_{D+1}} \end{bmatrix}$

convergence of the method - analysis of gradient descend
- when is guaranteed for the method to converge? ?: $\mathcal{L}(\beta^{(k))$ is continuous differentiable in the bounded set { $\beta | \mathcal{L}(\beta) < \mathcal{L}(\beta_0)$ } -- refer the [convergent theorem][1] - sequence { $\nabla \mathcal{L}(\beta^{(k))$} - `convergent order` , `linear convergent` , `convergent constant` [see here][2] - if the problem is see here quadratic optimization problem , the model can be - QP: minimize$ f(x) := 1/2 x^T Q x + c^T x $which is a ellipse and Q is symmetry matrix, in this case, eigenvalue is the radius of the ellipse and *the convergent constant depends very much on the ratio of the largest to the smallest eigenvalue of the Hessian matrix H(x) at the optimal soluction$ x^* $* **we want the ratio to be small, that is , nearly 1** - 从分析最简单的quadratic form problem 入手，此时Q[A,B;C,D]对应的HESSIAN = [2A,B+C;C+B,2B]就是说，在梯度下降的时候，(x,y)的变化是沿着($ f_x $,$ f_y $)的反方向，乘上$ \alpha $的，梯度最大的时候（此时二次倒数为0），如果x 和 y的梯度变化率（$ f_{xx} $,$ f_{yy}$）差不多，就可以同时接近最低点，convergent rate就比较快。如果是在quadratic的模型上，就是沿着radius的方向，如果近似圆的话，convergent最快。而radius就是H的eigenvalue，相当于把Q 做了linear transform 变成H，半径不变。
- 二次倒数就是Hessian matrix, 这个matrix是一个美丽的symmetry matrix,所以可以有很多美丽的性质，比如长得是ellipse，横截面的椭圆，positive definite, eigenvalue = radius of ellipse等等
- 其他的实例分析起来就比较麻烦...所以有个直觉就好，就是，梯度下降是沿着eigenvalue相关的radius方向，lambda的最大值和最小值越相近越好。（如果是2元函数，只有两个lambda，就是两个的比，如果>2, 可以看成多个二元，即(x,z)(x,y)(y,z)其中限制因素就是，最大比最小的那一组）
- ratio of largest to smallest eigenvalue = condition number of the matrix
Stopping criteria: -- optimality conditions -- design of gradient descend method
1. first derivate of $\mathcal{L}(\beta)$ which is $\frac{\partial \mathcal{L}(\beta)^{(k)}}{\partial \beta}= 0$ or $\nabla \mathcal{L}(\mathbf{\beta})^{(k)} = \begin{bmatrix}\frac{\partial \mathcal{L}(\mathbf{\beta})^{(k)}}{\partial \beta_0} \\ \frac{\partial \mathcal{L}(\mathbf{\beta})^{(k)}}{\partial \beta_1} \\ \frac{\partial \mathcal{L}(\mathbf{\beta})^{(k)}}{\partial \beta_2} \\ \vdots \\ \frac{\partial \mathcal{L}(\mathbf{\beta})^{(k)}}{\partial \beta_{D+1}} \end{bmatrix} = \begin{bmatrix}0 \\0 \\ \vdots \\0 \end{bmatrix}$
2. seconde derivate > 0 i.e Hessian is positive definite
Optimality conditions are still useful, in that they serve as a stopping
criterion when they are satisfied to within a predetermined error tolerance
tradeoff: faster convergence ⇔ higher computational cost per iteration
step-size selection: α
- requirement: Convergence to a local minimum is guaranteed only when $\alpha$ < $\alpha_{min}$ where $\alpha_{min}$ is a fxed constant that depends on the problem.
line-search method: used to set step-size automatically: backtracking
- MIT nonlinear programming
- line-search notes summary of the notes :
- - "bisenction algorithm for a line-search of a convex function" seek to solve:
- $\overline{\alpha} := arg min_{\alpha} f(\overline{x} + \alpha \overline{d})$
- x_bar : current iterate, d_bar: curretn direction generate by an algorithm that seeks to minimize f(x) such as a descent diection of f(x) at x = x_bar
- => let $h(\alpha) = f(\overline{x} + \alpha \overline{d})$
- goal: find $\alpha_0$ such that $h(\alpha)$ reach minimum
- first derivative: $h'(\alpha) = 0$

convexity

Outliers - Robust statistics:

outliers may bias the previous summary statistics, which can be solve by eliminating or downweighting the outlier values in the sample (quality control) or using statistics that are resistant to the presence of outliers
resistant != robust
robust is used in statistics to refer to insensitivity to choice of probability model or estimator rather than data value.

least square

least squares estimates for regression models are highly sensitive to (not robust against) outliers.