[关闭]
@TaoSama 2017-04-18T22:01:25.000000Z 字数 14689 阅读 1609

Machine Learning

Machine-Learning


click the top right button to show the content

0. Symbols

training examples, features
feature of example

1. Linear Regression

Hypothesis Function

Cost Function


Gradient Descent Algorithm




Simultaneous Update

2. Multivaritate Linear Regression

Hypothesis Function

Gradient Descent for Multiple Variables



Feature Scaling and Mean Normalization

Learning Rate

Features and Polynomial Regression

Normal Equation

, is matrix, is -dimensional vector

is to make a square matrix

3. Logistic Regression

Hypothesis Function

"Sigmoid Function", also called the "Logistic Function":



Decision Boundary

Cost Function

Gradient Descent








A vectorized implementation is:

Deduction:


PS: it will be equal with a minus sign

Advanced Optimization

use matlab/Octave built-in libraries, such as fminunc()

  1. function [jVal, gradient] = costFunction(theta, X, y)
  2. jVal = [...code to compute J(theta)...];
  3. gradient = [...code to compute derivative of J(theta)...];
  4. end
  5. options = optimset('GradObj', 'on', 'MaxIter', 100);
  6. initialTheta = zeros(2,1);
  7. [optTheta, functionVal, exitFlag] = ...
  8. fminunc(@(theta)(costFunction(theta, X, y)), initial_theta, options);

Multiclass Classification: One-vs-all

Train a logistic regression classifier for each class to predict the probability .
To make a prediction on a new , pick the class that maximizes






Overfitting

An instance of overfitting: a hypothesis having high variance and being unlikely to generalize well to new examples.

Regularization

Linear Regression
Logistic Regression

4. Neural Networks

Model Representation

Cost Function

Backpropagation Algorithm


Given training set

Backpropagation in Practice

  1. % unroll
  2. thetaVector = [ Theta1(:); Theta2(:); Theta3(:); ]
  3. deltaVector = [ D1(:); D2(:); D3(:) ]
  4. % back
  5. Theta1 = reshape(thetaVector(1:110),10,11)
  6. Theta2 = reshape(thetaVector(111:220),10,11)
  7. Theta3 = reshape(thetaVector(221:231),1,11)
  1. epsilon = 1e-4;
  2. for i = 1:n,
  3. thetaPlus = theta;
  4. thetaPlus(i) += epsilon;
  5. thetaMinus = theta;
  6. thetaMinus(i) -= epsilon;
  7. gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon)
  8. end;

5. Evaluating a Learning Algorithm

Evaluating a Hypothesis

Once we have done some trouble shooting for errors in our predictions by:

The test set error

  1. For linear regression:
  2. For classification ~ Misclassification error (aka 0/1 misclassification error):

    The average test error for the test set is:

Model Selection and Train/Validation/Test Sets

One way to break down our dataset into the three sets is:

We can now calculate three separate error values for the three different sets using the following method:

This way, the degree of the polynomial d has not been trained using the test set.

Bias vs. Variance

Diagnosing Bias vs. Variance

Regularization and Bias/Variance

  1. Create a list of s (i.e. ,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24);
    Create a set of models with different degrees or any other variants.
  2. Iterate through the s and for each go through all the models to learn some .
  3. Compute the cross validation error using the learned (computed with λ) on the without regularization or .
  4. Select the best combo and that produces the lowest error on the cross validation set.
  5. Using the best combo, apply it on to see if it has a good generalization of the problem.

Learning Curves


Deciding What to Do Next Revisited

System Design

Prioritizing What to Work On
Error Analysis
Handling Skewed Class
\ Actual Class: 1 Actual Class: 0
Predicted Class: 1 True Positive False Positive
Predicted Class: 0 False Negative True Negative


Using Large Data Sets

Rationale:
Useful test: Given the input , can a human expert confidently predict ?

6. Support Vector Machines

Large Margin Classification

SVM Hypothesis

Large Margin Intuition

SVM Decision Boundary

can be small, then is going to be large, that is large margin.

Kernels

Kernels ans Similarity

we define the similarity between and landmark as:

SVM Parameters

SVMs in Practice

7. Clustering

K-Means Algorithm

K-Means Optimization Objective

Random Initialization

Random initialization some times to pick the lowest cost one.

Principle Component Analysis

PCA is unsupervised learning, while LR is supervised learning

PCA Algorithm

Dimensionality Reduc1on

Application of PCA

Bad use of PCA: To prevent overfitting
This might work OK, but isn’t a good way to address overfitting.
Use regularization instead. PCA lost the information of

8. Anomaly Detection

Anomaly Detection Algorithm

Evaluation

can use the same metrics or method as the algorithm before

Anomaly Detection vs. Supervised Learning

Anomaly Detectioon with the Multivariate Gaussian

Relation to Original Model

Original Model vs. Multivariate Gaussian

Recommender Systems

Content-based Recommender Systems

Collaborative Filtering

Collaborative Filtering Algorithm

mean normalization, to make the undefined value be predicted meaningfully

9. Large Scale Machine Learning

Gradient Descent with Large Datasets

Batch Gradient Descent

Stochastic Gradient Descent

Mini-Batch Gradient Descent

Comparison

Checking forConvergence

Online Learning

Map Reduce and Data Parallelism

10. Application Example: Photo OCR

Photo OCR Pipeline

Sliding Windows

Getting Lots of Data and Artifical Data

Synthesizing data by introducing distortions

Discussion on Getting More Data

Ceiling Analysis: What Part of the Pipeline to Work on Next

11. Summary: Main Topics

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注