@TaoSama 2017-04-18T14:01:25.000000Z 字数 14689 阅读 1785

Machine Learning

Machine-Learning

click the top right button to show the content

0. Symbols

$m$ training examples, $n$ features
$x_j^{(i)}:=j^{th}$ feature of $i^{th}$ example

1. Linear Regression

Hypothesis Function

$h_{\theta}(x)=\theta_{0}+\theta_{1}x$

Cost Function

$LMS(Least\ Mean\ Square):$
$minimize\ J(\theta_{0},\theta_{1})={1\over 2m}\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})^2$

Gradient Descent Algorithm

$repeat\ until\ convergence\ \{$
$\ \ \theta_{j}:=\theta_{j}-\alpha{\partial\over \partial\theta_{j}}J(\theta_{0},\theta_{1}),\ j\in [0,\ 1]$
$\}$
Simultaneous Update

2. Multivaritate Linear Regression

Hypothesis Function

$h_{\theta}(x)=\theta_{0}+\theta_{1}x_1+\theta_{2}x_2+\cdots+\theta_{n}x_n$

Gradient Descent for Multiple Variables

$repeat\ until\ convergence\ \{$
$\ \ \theta_{j}:=\theta_{j}-\alpha{\partial\over \partial\theta_{j}}J(\theta_{0},\theta_{1},\cdots,\theta_{n}),\ j\in [0,\ n]$
$\}$

Feature Scaling and Mean Normalization

make $x^{(i)}$ fit in $[-1,\ 1]$ or $[-0.5,\ 0.5]$
3 times is also ok, $[-3,\ 3]$ ...
$x_i := \dfrac{x_i - \mu_i}{s_i}$
Where $\mu_i$ is the average of all the values for feature $x^{(i)}$ and $s_i$ is the range of values $(max - min)$ , or $s_i$ is the standard deviation.

Learning Rate

If $\alpha$ is too small: slow convergence.
If $\alpha$ is too large: may not decrease on every iteration and thus may not converge.
To choose $\alpha$ , try $\underbrace{0.001,\ 0.003}_{3\times},\ \underbrace{0.01,\ 0.03}_{3\times},\ \underbrace{0.1,\ 0.3}_{3\times},\ 1$

Features and Polynomial Regression

combine multiple features into one
change feature $x^{(i)}$ to polynomial $(x^{(i)})^k$ as a feature
don't forget feature scaling

Normal Equation

$\theta = (X^T X)^{-1}X^T y$ , $X$ is $m\times (n+1)$ matrix, $y$ is $m$ -dimensional vector

$PS: X^TX$ is to make a square matrix

There is no need to do feature scaling with the normal equation.
The following is a comparison of gradient descent and the normal equation:

Gradient Descent	Normal Equation
Need to choose alpha	No need to choose alpha
Needs many iterations	No need to iterate
$O (kn^2)$	$O (n^3)$ , need to calculate inverse of $X^TX$
Works well when n is large	Slow if n is very large

when $X^T X$ is non-invertible:
- Redundant features, where two features are very closely related (i.e. they are linearly dependent)
- Too many features (e.g. $m \le n$ ). In this case, delete some features or use "regularization" (to be explained in a later lesson).

3. Logistic Regression

Hypothesis Function

"Sigmoid Function", also called the "Logistic Function":

$h_\theta (x) = g ( \theta^T x ),\ 0 \leq h_\theta (x) \leq 1$
$z = \theta^T x$
$g(z) = \dfrac{1}{1 + e^{-z}}$

Decision Boundary

$g(z)\ge 0,\ when\ z\ge 0$
$z=0, e^{0}=1 \Rightarrow g(z)=1/2$
$z \to \infty, e^{-\infty} \to 0 \Rightarrow g(z)=1$
$z \to -\infty, e^{\infty}\to \infty \Rightarrow g(z)=0$
if $\theta^TX$ is taken as input:
$h_\theta(x) = g(\theta^T x) \geq 0.5,\ when \; \theta^T x \geq 0$

Cost Function

$J(\theta) = \dfrac{1}{m} \sum_{i=1}^m \mathrm{Cost}(h_\theta(x^{(i)}),y^{(i)})$
$\mathrm{Cost}(h_\theta(x),y) = -\log(h_\theta(x))\ \ \ \ \ \ \ \ \ \text{if y = 1}$
$\mathrm{Cost}(h_\theta(x),y) = -\log(1-h_\theta(x))\ \ \text{if y = 0}$
$\mathrm{Cost}(h_\theta(x),y) = 0 \text{ if } h_\theta(x) = y$
$\mathrm{Cost}(h_\theta(x),y) \rightarrow \infty \text{ if } y = 0 \; \mathrm{and} \; h_\theta(x) \rightarrow 1$
$\mathrm{Cost}(h_\theta(x),y) \rightarrow \infty \text{ if } y = 1 \; \mathrm{and} \; h_\theta(x) \rightarrow 0$
Based on maximum liklihood estimation, simplify it:
$J(\theta) = - \frac{1}{m} \displaystyle \sum_{i=1}^m [y^{(i)}\log (h_\theta (x^{(i)})) + (1 - y^{(i)})\log (1 - h_\theta(x^{(i)}))]$
A vectorized implementation is:
$J(\theta) = \frac{1}{m} \cdot \left(-y^{T}\log(h)-(1-y)^{T}\log(1-h)\right),\ h = g(X\theta)$

Gradient Descent

$Repeat\ until\ convergence\ \lbrace$
$\ \ \theta_j := \theta_j - \alpha \dfrac{\partial}{\partial \theta_j}J(\theta)$
$\rbrace$
$\ \Downarrow$
$Repeat\ until\ convergence\ \lbrace$
$\ \ \theta_j := \theta_j - \frac{\alpha}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)}$
$\rbrace$
A vectorized implementation is:
$\theta := \theta - \frac{\alpha}{m} X^{T} (g(X \theta ) - \vec{y})$

Deduction:

PS: it will be equal with a minus sign

Advanced Optimization

use matlab/Octave built-in libraries, such as fminunc()

function [jVal, gradient] = costFunction(theta, X, y)
  jVal = [...code to compute J(theta)...];
  gradient = [...code to compute derivative of J(theta)...];
end
options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2,1);
[optTheta, functionVal, exitFlag] = ...
fminunc(@(theta)(costFunction(theta, X, y)), initial_theta, options);

Multiclass Classification: One-vs-all

Train a logistic regression classifier $h_{\theta}(x)$ for each class to predict the probability $y=i$ .
To make a prediction on a new $x$ , pick the class that maximizes $h_{\theta}(x)$

$y \in \lbrace0, 1 ... n\rbrace$
$h_\theta^{(0)}(x) = P(y = 0 | x ; \theta)$
$h_\theta^{(1)}(x) = P(y = 1 | x ; \theta)$
$\cdots$
$h_\theta^{(n)}(x) = P(y = n | x ; \theta)$
$\mathrm{prediction} = \max_i( h_\theta ^{(i)}(x) )$

Overfitting

An instance of overfitting: a hypothesis having high variance and being unlikely to generalize well to new examples.

Reduce the number of features:
- Manually select which features to keep.
- Use a model selection algorithm (studied later in the course).
Regularization
- Keep all the features, but reduce the magnitude of parameters $\theta_j$ .
- Regularization works well when we have a lot of slightly useful features.

Regularization

Linear Regression

Cost Function
$\displaystyle\min_{\theta} J(\theta),\ J(\theta)= \dfrac{1}{2m}\ \left[ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda\ \sum_{j=1}^n \theta_j^2 \right]$
The $\lambda$ , or lambda, is the regularization parameter
Gradient Descent
$\text{Repeat until convergence}\ \lbrace$
$\ \ \theta_0 := \theta_0 - \alpha\ \frac{1}{m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_0^{(i)}$
$\ \ \theta_j := \theta_j - \alpha\ \left[ \left( \frac{1}{m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} \right) + \frac{\lambda}{m}\theta_j \right]\ \ \ \ \ \ \ \ \ \ j \in \lbrace 1,2...n\rbrace$
$\rbrace$
with some manipulation:
$\theta_j := \theta_j(1 - \alpha\frac{\lambda}{m}) - \alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}$
Normal Equation
$\theta = \left( X^TX + \lambda \cdot L \right)^{-1} X^Ty$
$\text{where}\ \ L = \begin{bmatrix} 0 & & & & \newline & 1 & & & \newline & & 1 & & \newline & & & \ddots & \newline & & & & 1 \newline\end{bmatrix}$
PS: I don't know how to calculate it yet! A pity!

Logistic Regression

Cost Function
$J(\theta) = - \frac{1}{m} \sum_{i=1}^m \large[ y^{(i)}\ \log (h_\theta (x^{(i)})) + (1 - y^{(i)})\ \log (1 - h_\theta(x^{(i)}))\large] + \frac{\lambda}{2m}\sum_{j=1}^n \theta_j^2$
Gradient Descent
same as linear regression in form but different $h_{\theta}(x)$

4. Neural Networks

Model Representation

In neural networks, we use the same logistic function as in classification, $\frac{1}{1 + e^{-\theta^Tx}}$ , yet we sometimes call it a sigmoid (logistic) activation function. In this situation, our "theta" parameters are sometimes called "weights".
Visually, a simplistic representation looks like:
$input\ layer\to hidden\ layers\to output\ layer$
$\begin{bmatrix}x_0 \newline x_1 \newline x_2 \newline x_3\end{bmatrix}\rightarrow\begin{bmatrix}a_1^{(2)} \newline a_2^{(2)} \newline a_3^{(2)} \newline \end{bmatrix}\rightarrow h_\theta(x)$
$\begin{align*}& a_i^{(j)} = \text{"activation" of unit $i$ in layer $j$} \newline& \Theta^{(j)} = \text{matrix of weights controlling function mapping from layer $j$ to layer $j+1$}\end{align*}$
If network has $s_j$ units in layer $j$ and $s_{j+1}$ units in layer $j+1$ , then $\Theta^{(j)}$ will be of dimension $s_{j+1} \times (s_j + 1)$ .
Setting $x=a^{(1)}$ , we can get the equation: $a^{(j)} = g(\Theta^{(j-1)}a^{(j-1)})$

Cost Function

$L$ = total number of layers in the network
$s_l$ = number of units (not counting bias unit) in layer $l$
$K$ = number of output units/classes

$\begin{gather*} J(\Theta) = - \frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K \left[y^{(i)}_k \log ((h_\Theta (x^{(i)}))_k) + (1 - y^{(i)}_k)\log (1 - (h_\Theta(x^{(i)}))_k)\right] + \frac{\lambda}{2m}\sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}} ( \Theta_{j,i}^{(l)})^2\end{gather*}$

Backpropagation Algorithm

Given training set $\lbrace (x^{(1)}, y^{(1)}) \cdots (x^{(m)}, y^{(m)})\rbrace$

Set $\Delta^{(l)}_{i,j}:= 0$ for all $(l,i,j)$
For training example $t =1 \to m$ :
1. Set $a^{(1)}:=x^{(t)}$
2. Perform forward propagation to compute $a^{(l)}$ for $l=2,3,\cdots,L$
3. Using $y^{(t)}$ , compute $\delta^{(L)} = a^{(L)} - y^{(t)}$
4. Compute $\delta^{(L-1)}, \delta^{(L-2)},\dots,\delta^{(2)}$ using $\delta^{(l)} = ((\Theta^{(l)})^T \delta^{(l+1)})\ .*\ a^{(l)}\ .*\ (1 - a^{(l)})$
  $g'(z^{(l)})=g(z^{(l)})(1-g(z^{(l)}))=a^{(l)}(1 - a^{(l)})$
5. $\Delta^{(l)}_{i,j} := \Delta^{(l)}_{i,j} + a_j^{(l)} \delta_i^{(l+1)}$ or with vectorization, $\Delta^{(l)} := \Delta^{(l)} + \delta^{(l+1)}(a^{(l)})^T$
Hence we update our new Δ matrix.
$D^{(l)}_{i,j} := \dfrac{1}{m}\left(\Delta^{(l)}_{i,j} + \lambda\Theta^{(l)}_{i,j}\right),\ if\ j\neq 0$
$D^{(l)}_{i,j} := \dfrac{1}{m}\Delta^{(l)}_{i,j},\ if\ j=0$
Thus we get $\frac \partial {\partial \Theta_{ij}^{(l)}} J(\Theta)=D_{ij}^{(l)}$

Backpropagation in Practice

Unrolling Parameters

% unroll
thetaVector = [ Theta1(:); Theta2(:); Theta3(:); ]
deltaVector = [ D1(:); D2(:); D3(:) ]
% back
Theta1 = reshape(thetaVector(1:110),10,11)
Theta2 = reshape(thetaVector(111:220),10,11)
Theta3 = reshape(thetaVector(221:231),1,11)

Gradient Checking
make sure of only checking once
The approximate value of gradient:
$\dfrac{\partial}{\partial\Theta}J(\Theta) \approx \dfrac{J(\Theta + \epsilon) - J(\Theta - \epsilon)}{2\epsilon}$
$\dfrac{\partial}{\partial\Theta_j}J(\Theta) \approx \dfrac{J(\Theta_1, \dots, \Theta_j + \epsilon, \dots, \Theta_n) - J(\Theta_1, \dots, \Theta_j - \epsilon, \dots, \Theta_n)}{2\epsilon}$
a good way to choose epsilon is: $\epsilon = \sqrt{{6\over s_l\cdot s_{l+1}}}$

epsilon = 1e-4;
for i = 1:n,
  thetaPlus = theta;
  thetaPlus(i) += epsilon;
  thetaMinus = theta;
  thetaMinus(i) -= epsilon;
  gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon)
end;

Random Initialization

5. Evaluating a Learning Algorithm

Evaluating a Hypothesis

Once we have done some trouble shooting for errors in our predictions by:

Getting more training examples
Trying smaller sets of features
Trying additional features
Trying polynomial features
Increasing or decreasing $\lambda$

The test set error

For linear regression: $J_{test}(\Theta) = \dfrac{1}{2m_{test}} \sum_{i=1}^{m_{test}}(h_\Theta(x^{(i)}_{test}) - y^{(i)}_{test})^2$
For classification ~ Misclassification error (aka 0/1 misclassification error):
$err(h_\Theta(x),y) = \begin{matrix} 1 & \mbox{if } h_\Theta(x) \geq 0.5\ and\ y = 0\ or\ h_\Theta(x) < 0.5\ and\ y = 1\newline 0 & \mbox otherwise \end{matrix}$
The average test error for the test set is:
$\text{Test Error} = \dfrac{1}{m_{test}} \sum^{m_{test}}_{i=1} err(h_\Theta(x^{(i)}_{test}), y^{(i)}_{test})$

Model Selection and Train/Validation/Test Sets

One way to break down our dataset into the three sets is:

Training set: 60%
Cross validation set: 20%
Test set: 20%

We can now calculate three separate error values for the three different sets using the following method:

Optimize the parameters in $\Theta$ using the training set for each polynomial degree.
Find the polynomial degree d with the least error using the cross validation set.
Estimate the generalization error using the test set with $J_{test}(\Theta^{(d)})$ , ( $d = \Theta$ from polynomial with lower error);

This way, the degree of the polynomial d has not been trained using the test set.

Bias vs. Variance

Diagnosing Bias vs. Variance

High bias (underfitting): both $J_{train}(\Theta)$ and $J_{CV}(\Theta)$ will be high. Also, $J_{CV}(\Theta)\approx J_{train}(\Theta)$ .
High variance (overfitting): $J_{train}(\Theta)$ will be low and $J_{CV}(\Theta)$ will be much greater than $J_{train}(\Theta)$ .

Regularization and Bias/Variance

Create a list of $\lambda$ s (i.e. $\lambda\in\{$ ,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24 $\}$ );
Create a set of models with different degrees or any other variants.
Iterate through the $\lambda$ s and for each $\lambda$ go through all the models to learn some $\Theta$ .
Compute the cross validation error using the learned $\Theta$ (computed with λ) on the $J_{CV}(\Theta)$ without regularization or $\lambda= 0$ .
Select the best combo $\Theta$ and $\lambda$ that produces the lowest error on the cross validation set.
Using the best combo, apply it on $J_{test}(\Theta)$ to see if it has a good generalization of the problem.

Learning Curves

Deciding What to Do Next Revisited

Getting more training examples: Fixes high variance
Trying smaller sets of features: Fixes high variance
Adding features: Fixes high bias
Adding polynomial features: Fixes high bias
Decreasing λ: Fixes high bias
Increasing λ: Fixes high variance.

System Design

Prioritizing What to Work On

Collect lots of data (for example "honeypot" project but doesn't always work)
Develop sophisticated features (for example: using email header data in spam emails)
Develop algorithms to process your input in different ways (recognizing misspellings in spam).
It is difficult to tell which of the options will be most helpful.

Error Analysis

Start with a simple algorithm, implement it quickly, and test it early on your cross validation data.
Plot learning curves to decide if more data, more features, etc. are likely to help.
Manually examine the errors on examples in the cross validation set and try to spot a trend where most of the errors were made.

Handling Skewed Class

Error Metrics

\	Actual Class: 1	Actual Class: 0
Predicted Class: 1	True Positive	False Positive
Predicted Class: 0	False Negative	True Negative

$precision=\frac{true\ positives}{predicted\ positives}=\frac{true\ positives}{true\ positives+false\ positives}$
$recall=\frac{true\ positives}{actual\ positives}=\frac{true\ positives}{true\ positives+false\ negatives}$

Trade Off of Precision and Recall
$F_1(F)$ Score $=2{PR\over P+R}={2\over ({1\over P}+{1\over R})}(调和平均数)$

Using Large Data Sets

Rationale:
Useful test: Given the input $x$ , can a human expert confidently predict $y$ ?

Use a learning algorithm with many parameters
Use a very large training set (unlikely to overfit)

6. Support Vector Machines

Large Margin Classification

SVM Hypothesis

Large Margin Intuition

SVM Decision Boundary

$\lVert \theta\rVert$ can be small, then $p^{(i)}$ is going to be large, that is large margin.

Kernels

Kernels ans Similarity

we define the similarity between $x$ and landmark $l^{(i)}$ as:
$f_i=similarity(x,l^{(i)})=exp(-\frac{\Arrowvert x-l^{(i)}\Arrowvert^2}{2\sigma^2})\\$

$If\ x\approx l^{(1)}$ , $f_i=exp(-\frac{0^2}{2\sigma^2})\approx 1$
$If\ x\ is\ far\ from l^{(1)}$ , $f_i=exp(-\frac{large\ number^2}{2\sigma^2})\approx 1$

SVM Parameters

SVMs in Practice

7. Clustering

K-Means Algorithm

K-Means Optimization Objective

Random Initialization

Should have $K<m$
Randomly pick $K$ training examples.
Set $\mu_1,\ldots,\mu_K$ equal to these $K$ examples.

Random initialization some times to pick the lowest cost one.

Principle Component Analysis

PCA is unsupervised learning, while LR is supervised learning

PCA Algorithm

Reconstruction from Compressed Representation
$z=U_{reduce}^Tx\Rightarrow x =U_{reduce}z$
$k\times 1=k\times n\cdot n\times 1\Rightarrow n\times 1=n\times k\cdot k\times 1$

Dimensionality Reduc1on

Choosing the Number of Principal Components

Application of PCA

Bad use of PCA: To prevent overfitting
This might work OK, but isn’t a good way to address overfitting.
Use regularization instead. PCA lost the information of $y$