linear regression
ml
[]
linear regression model
y=⎡⎣⎢⎢⎢⎢y0y1⋮yN⎤⎦⎥⎥⎥⎥≈f(xn)estimate
:=[1x1x2⋯xD]⎡⎣⎢⎢⎢β0β1⋯βD⎤⎦⎥⎥⎥
=⎡⎣⎢⎢⎢⎢1111x11x21⋮xN1x12x22⋮xN2⋯⋯⋮⋯x1Dx2D⋮xND⎤⎦⎥⎥⎥⎥⎡⎣⎢⎢⎢β0β1⋯βD⎤⎦⎥⎥⎥
=X˜Tnβ
- consider the residual error:
y
=⎡⎣⎢⎢⎢⎢y0y1⋮yN⎤⎦⎥⎥⎥⎥
=⎡⎣⎢⎢⎢⎢1111x11x21⋮xN1x12x22⋮xN2⋯⋯⋮⋯x1Dx2D⋮xND⎤⎦⎥⎥⎥⎥⎡⎣⎢⎢⎢β0β1⋯βD⎤⎦⎥⎥⎥+⎡⎣⎢⎢⎢⎢ϵ0ϵ1⋯ϵD⎤⎦⎥⎥⎥⎥
=X˜Tnβ+ϵ
- Assume the error in Gaussian distribution ?? since that we do not know the exact error
ϵ∼N(μ,σ2)
- rewrite the linear regression model:
for the ith data sample:
p(yi|xi,θ)=p(y|xi,(w,σ2i))
=N(μ(xi),σ2(xi))
=N(wTxi,σ2(xi))
in our notation:
p(y|x,θ)=p(y|x,(w,σ2))=N(wTX,ϵ2(x))=N(X˜Tnβ,ϵ2(x))
p(⎡⎣⎢⎢⎢⎢⎢y0y1⋮yN⎤⎦⎥⎥⎥⎥⎥|⎡⎣⎢⎢⎢⎢⎢1111x11x21⋮xN1x12x22⋮xN2⋯⋯⋮⋯x1Dx2D⋮xND⎤⎦⎥⎥⎥⎥⎥T,(⎡⎣⎢⎢⎢⎢β0β1⋯βD⎤⎦⎥⎥⎥⎥,⎡⎣⎢⎢⎢⎢⎢ϵ0ϵ1⋯ϵD⎤⎦⎥⎥⎥⎥⎥2))
=N(⎡⎣⎢⎢⎢⎢⎢1111x11x21⋮xN1x12x22⋮xN2⋯⋯⋮⋯x1Dx2D⋮xND⎤⎦⎥⎥⎥⎥⎥⎡⎣⎢⎢⎢⎢β0β1⋯βD⎤⎦⎥⎥⎥⎥,⎡⎣⎢⎢⎢ϵ0ϵ1⋯ϵD⎤⎦⎥⎥⎥2
- for basis function expansion:
\phi(\mathbf{x})
p(y|x,θ)=p(y|x,(w,σ2,ϕ))=N(wTϕ(x),ϵ2(x))
=N(ϕ(X˜Tn)β,ϵ2(x))
ϕ(x)=[1,x,x2⋯,xd]
ϕ(X˜Tn)?=[1,X˜Tn,(X˜Tn)2⋯,(X˜Tn)d]
cost function
automatic way to define loss function?
Two desirable properties of cost functions
1.+ve and -ve errors should be penalized equally.
2.penalize "large" mistakes and "very large" mistakes almost equally.
Statistical vs computational(Convexity) tradeoff:
cost function L(β)=L(⎡⎣⎢⎢⎢β0β1⋯βD⎤⎦⎥⎥⎥)=L(β0,β1,⋯,βD)
- MAE := Σn=1,2,..N∥yn−f(xn)estimate∥ = Σn=1,2,..N∥ϵn∥=∥ϵ1∥+∥ϵ2∥+...+∥ϵn∥
- MSE:
- Huber loss:
- Tufey's bisquare loss: defined interms of gradients
goal, find β∗ such that L(β) reach minimum, noticed that β∈RD+1
- an Unconstrained Optimization Problem
- existence of optimal solution?
- characterizetion of optimal solution
- algo for computing the optimal solution
- method: Grid search; gradient descend; least square
GRID SEARCH
- extremely simple
- works for any knid of loss
- but high exponential computational complexity
BATCH GRADIENT DESCENT
=⎡⎣⎢⎢⎢β0β1⋯βD⎤⎦⎥⎥⎥(k)−α∇L(⎡⎣⎢⎢⎢β0β1⋯βD⎤⎦⎥⎥⎥(k))
=⎡⎣⎢⎢⎢β0β1⋯βD⎤⎦⎥⎥⎥(k)−α∇L(β(k)0,β(k)1,⋯,β(k)D)
=⎡⎣⎢⎢⎢β0β1⋯βD⎤⎦⎥⎥⎥(k)−α∇L(β(k)0,β(k)1,⋯,β(k)D)
=⎡⎣⎢⎢⎢⎢β(k)0β(k)1⋯β(k)D⎤⎦⎥⎥⎥⎥−α⎡⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢∂L(β(k)0,β(k)1,⋯,β(k)D)∂β0∂L(β(k)0,β(k)1,⋯,β(k)D)(k)∂β1∂L(β(k)0,β(k)1,⋯,β(k)D)∂β2⋮∂L(β(k)0,β(k)1,⋯,β(k)D)∂βD+1⎤⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥
convergence of the method - analysis of gradient descend
when is guaranteed for the method to converge?
?: \mathcal{L}(\beta^{(k)) is continuous differentiable in the bounded set {β|L(β)<L(β0)} -- refer the [convergent theorem][1]
- sequence {∇L(β(k))$−‘convergentorder‘,‘linearconvergent‘,‘convergentconstant‘[seehere][2]−iftheproblemisseeherequadraticoptimizationproblem,themodelcanbe−QP:minimizef(x) := 1/2 x^T Q x + c^T xwhichisaellipseandQissymmetrymatrix,inthiscase,eigenvalueistheradiusoftheellipseand∗theconvergentconstantdependsverymuchontheratioofthelargesttothesmallesteigenvalueoftheHessianmatrixH(x)attheoptimalsoluctionx^*∗∗∗wewanttheratiotobesmall,thatis,nearly1∗∗−从分析最简单的quadraticformproblem入手,此时Q[A,B;C,D]对应的HESSIAN=[2A,B+C;C+B,2B]就是说,在梯度下降的时候,(x,y)的变化是沿着(f_x,f_y)的反方向,乘上\alpha的,梯度最大的时候(此时二次倒数为0),如果x和y的梯度变化率(f_{xx},f_{yy}$)差不多,就可以同时接近最低点,convergent rate就比较快。如果是在quadratic的模型上,就是沿着radius的方向,如果近似圆的话,convergent最快。而radius就是H的eigenvalue,相当于把Q 做了linear transform 变成H,半径不变。
- 二次倒数就是Hessian matrix, 这个matrix是一个美丽的symmetry matrix,所以可以有很多美丽的性质,比如长得是ellipse,横截面的椭圆,positive definite, eigenvalue = radius of ellipse等等
- 其他的实例分析起来就比较麻烦...所以有个直觉就好,就是,梯度下降是沿着eigenvalue相关的radius方向,lambda的最大值和最小值越相近越好。(如果是2元函数,只有两个lambda,就是两个的比,如果>2, 可以看成多个二元,即(x,z)(x,y)(y,z)其中限制因素就是,最大比最小的那一组)
ratio of largest to smallest eigenvalue = condition number of the matrix
Stopping criteria: -- optimality conditions -- design of gradient descend method
- first derivate of L(β) which is ∂L(β)(k)∂β=0 or ∇L(β)(k)=⎡⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢∂L(β)(k)∂β0∂L(β)(k)∂β1∂L(β)(k)∂β2⋮∂L(β)(k)∂βD+1⎤⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥=⎡⎣⎢⎢⎢00⋮0⎤⎦⎥⎥⎥
- seconde derivate > 0 i.e Hessian is positive definite
Optimality conditions are still useful, in that they serve as a stopping
criterion when they are satisfied to within a predetermined error tolerance
tradeoff: faster convergence ⇔ higher computational cost per iteration
- step-size selection: α
- requirement: Convergence to a local minimum is guaranteed only when α < αmin where αmin is a fxed constant that depends on the problem.
- line-search method: used to set step-size automatically:
backtracking
- MIT nonlinear programming
- line-search notes summary of the notes :
-
- "bisenction algorithm for a line-search of a convex function" seek to solve:
- α¯¯:=argminαf(x¯+αd¯)
- x_bar : current iterate, d_bar: curretn direction generate by an algorithm that seeks to minimize f(x) such as a descent diection of f(x) at x = x_bar
- => let h(α)=f(x¯+αd¯)
- goal: find α0 such that h(α) reach minimum
- first derivative: h′(α)=0
convexity
Outliers - Robust statistics:
- outliers may bias the previous summary statistics, which can be solve by eliminating or downweighting the outlier values in the sample (
quality control
) or using statistics that are resistant
to the presence of outliers
- resistant != robust
robust
is used in statistics to refer to insensitivity to choice of probability model or estimator rather than data value.
least square
- least squares estimates for regression models are highly sensitive to (not robust against) outliers.