@lyc102 2018-01-26T04:36:10.000000Z 字数 2530 阅读 2107

Stochastic Gradient Descent (SGD) Methods

machine_learning

Stochastic Gradient Descent (SGD) Methods

Problems and Algorithms

Consder the minimization problem

$\min_W J^N(W) := \frac{1}{N}\sum_{i=1}^N J_i(W),$
where

$J_i(W)=J(W; x_i, y_i)$ .

GD methods:

$W^{k+1} = W^k - \alpha_k \bar G_{\boldsymbol \xi_k},$ where

$\boldsymbol \xi_k = \{\xi_{k,1}, \xi_{k,n_k}\}\subseteq \{1,2, \ldots, N\}$ is a sampled index set and

$\bar G_{\boldsymbol \xi_k}$ is the average of the graident in this sample set.

$n_k = N$ : full GD
$n_k=1$ : SGD
$1<n_k<n$ : mini-batch GD

Convergence Analysis of SGD

Theorem. If $J^N$ is 1) first order Lip continuous and 2) stricktly convex, then with $\alpha_k\leq 2/L$ , there exists $\rho<1$ s.t.

$E[J(W^{k+1}) - J^*]\leq \Lambda^k + \rho^k \left (E[J(W^{k}) - J^*] - \Lambda \right),$ where

$\Lambda = \frac{\alpha_k L(1-f)\sigma_k^2}{2\mu (2-\alpha_k \mu)n}$ with

$f= n_k/N$ is the sample ratio and

$\sigma_k^2$ is the sampled variance.

Improvments

Pro and Con

Pro: fast and easy to implement.
Disadvantanges:
- Hard to chose $\alpha_k$
- $\alpha_k$ might be a vector not a scalar
- stuck in a local minimum or saddle points

Momentum

$\begin{align} g^k &= \nabla J(W^{k-1})\\ m^k &= \gamma m^{k-1} + \alpha g^k\\ W^k &= W^{k-1} - m^k \end{align}$

NAG

$\begin{align} g^k &= \nabla J(W^{k-1} - \gamma m^{k-1})\\ m^k &= \gamma m^{k-1} + \alpha g^k\\ W^k &= W^{k-1} - m^k \end{align}$

Adaptive Learning Rate

large $\alpha_i$ : when $W_i$ update less
small $\alpha_i$ : when $W_i$ update more

AdaGrad

$\begin{align} g^k &= \nabla J(W^{k})\\ G^k &= G^{k-1} + g^k *.g^k\\ W^k &= W^{k-1} - \frac{\alpha}{\sqrt{G^k+\epsilon}} *.g^k. \end{align}$

RMSProp

$\begin{align} g^k &= \nabla J(W^{k})\\ G^k &= \gamma G^{k-1} + (1-\gamma) g^k *.g^k\\ W^k &= W^{k-1} - \frac{\alpha}{\sqrt{G^k+\epsilon}} *.g^k. \end{align}$

AdaDelta

$\begin{align} g^k &= \nabla J(W^{k})\\ G^k &= \gamma G^{k-1} + (1-\gamma) g^k *.g^k\\ \Delta W^k &= - \frac{\sqrt{\Delta^{k-1} + \epsilon}}{\sqrt{G^k + \epsilon}}.*g^k\\ \Delta^k &= \gamma \Delta^{k-1} + (1-\gamma) \Delta W^k.*\Delta W^k\\ W^k &= W^{k-1} + \Delta W^k. \end{align}$

Adam

$\begin{align} g^k &= \nabla J(W^{k})\\ m^k &= \beta_1 m^{k-1} + (1-\beta_1) g^k\\ G^k &= \beta_2 G^{k-1} + (1-\beta_2) g^k.*g^k\\ W^k &= W^{k-1} - \alpha \frac{m^k/(1-\beta_1^k)}{G^k/(1-\beta_2^k) + \epsilon} \end{align}$

Convergence Analsyis of Adam

Define Regret

$R(T) = \sum_{t=1}^T \left [J^t(W^t)- J^t(W^*)\right ]$
With certain assumptions:

$\frac{R(T)}{T} = O\left (\frac{1}{\sqrt{T}}\right ).$

Full Gradient
- convex: $O(1/k)$
- strong convex: $O(\rho^k)$
SGD
- convex: $O(1/\sqrt{k})$
- strong convex: $O(1/k)$

Exercise

Consider the least squares problem

$\min_{x\in \mathbb R^n}\|b-Ax\|^2,$ where

$A_{m\times n}$ is a tall matrix, i.e.,

$m>n$ .

Apply SGD to solve the least squares problem and present convergence proof.