@KarlYan95 2017-10-08T14:30:03.000000Z 字数 7773 阅读 619

scikit-learn tutorial

Python

1. introduction

1.1 Model persistence 模型存储

1.1.1 方式1

It is possible to save a model in the scikit by using Python’s built-in persistence model, namely pickle:
可以通过使用Python的内置持久性模型（即pickle）将模型保存在scikit中：

>>> from sklearn import svm
>>> from sklearn import datasets
>>> clf = svm.SVC()
>>> iris = datasets.load_iris()
>>> X, y = iris.data, iris.target
>>> clf.fit(X, y)  
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
>>> import pickle
>>> s = pickle.dumps(clf)
>>> clf2 = pickle.loads(s)
>>> clf2.predict(X[0:1])
array([0])
>>> y[0]
0

1.1.2 方式2

In the specific case of the scikit, it may be more interesting to use joblib’s replacement of pickle (joblib.dump & joblib.load), which is more efficient on big data, but can only pickle to the disk and not to a string:

在scikit的具体情况下，使用joblib替换pickle（joblib.dump和joblib.load）可能会更有趣，这对大数据效率更高，但只能存储到磁盘而不是字符串：

>>> from sklearn.externals import joblib
>>> joblib.dump(clf, 'filename.pkl')

Later you can load back the pickled model (possibly in another Python process) with:

>>>
>>> clf = joblib.load('filename.pkl')

2. A tutorial on statistical-learning for scientific data processing 关于科学数据处理的统计学习教程

Statistical learning
Machine learning is a technique with a growing importance, as the size of the datasets experimental sciences are facing is rapidly growing. Problems it tackles range from building a prediction function linking different observations, to classifying observations, or learning the structure in an unlabeled dataset.
This tutorial will explore statistical learning, the use of machine learning techniques with the goal of statistical inference: drawing conclusions on the data at hand.
Scikit-learn is a Python module integrating classic machine learning algorithms in the tightly-knit world of scientific Python packages (NumPy, SciPy, matplotlib).
统计学习
机器学习是一个越来越重要的技术，因为数据集实验科学面临的规模正在迅速增长。它处理的问题范围从构建一个连接不同观察值的预测函数，分类观察值，或者在未标记的数据集中学习结构。
本教程将探讨统计学习，使用机器学习技术与统计推断的目标：对手头的数据作出结论。
Scikit-learning是一个Python模块，将经典机器学习算法整合在科学Python包（NumPy，SciPy，matplotlib）紧密编织的世界中。

2.1 Statistical learning: the setting and the estimator object in scikit-learn 统计学习：scikit-learn中的设置及估计对象

2.1.1 Datasets 数据集

Scikit-learn deals with learning information from one or more datasets that are represented as 2D arrays. They can be understood as a list of multi-dimensional observations. We say that the first axis of these arrays is the samples axis, while the second is the features axis.

sklearn处理来自表示为2D数组的一个或多个数据集的学习信息。它们可以被理解为多维观察的列表。我们说这些阵列的第一个轴是样本轴，而第二个轴是特征轴。

>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> data = iris.data
>>> data.shape
(150, 4)

在下面这个例子中，digits具有1797个样本，每个样本是8*8的矩阵。通过reshape处理后，变为一个1797个样本，每个样本是一个64维向量的数据集。

from sklearn import datasets
digits = datasets.load_digits()
print digits.images.shape  # (1797, 8, 8)
data = digits.images.reshape((digits.images.shape[0], -1))
print data.shape  # (1797, 64)

2.1.2 Estimators objects 估计对象

拟合数据：由sklearn实现的主要API是估计量。估计是从数据中学习的任何对象; 它可能是从原始数据中提取/过滤有用特征的分类，回归或聚类算法或变换器。
估计器参数：估计器的所有参数可以在实例化时设置，或通过修改相应的属性
估计参数：当数据与估计器配合时，根据手头的数据估计参数。所有估计的参数是以下划线结尾的估计对象的属性

2.2 Supervised learning: predicting an output variable from high-dimensional observations 监督学习：由高维对象预测输出值

监督学习要解决的问题是通过特征集x，导出它们的标记y。所有的监督学习方法都是通过fit(x, y)来训练出model，通过predict(x)来返回所要预测的标签y

2.2.1 k-邻近算法和维度诅咒

2.2.1.1 k-邻近算法

最简单的分类方法莫过于邻近算法，“物以类聚，人以群分”。

from sklearn import datasets
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
# 导入数据
iris = datasets.load_iris()
data = iris.data
label = iris.target
# 生成随机index
np.random.seed(0)
indices = np.random.permutation(len(data))
# split data
data_train = data[indices[:-10]]
data_train_label = label[indices[:-10]]
data_test = data[indices[-10:]]
data_test_label = label[indices[-10:]]
# 分类
knn = KNeighborsClassifier()
knn.fit(data_train, data_train_label)
print knn.predict(data_test)  # [1 2 1 0 0 0 2 1 2 0]
print data_test_label  # [1 1 1 0 0 0 2 1 2 0]

2.2.1.2 维度诅咒

For an estimator to be effective, you need the distance between neighboring points to be less than some value d, which depends on the problem. In one dimension, this requires on average n \sim 1/d points. In the context of the above k-NN example, if the data is described by just one feature with values ranging from 0 to 1 and with n training observations, then new data will be no further away than 1/n. Therefore, the nearest neighbor decision rule will be efficient as soon as 1/n is small compared to the scale of between-class feature variations.
If the number of features is p, you now require n \sim 1/d^p points. Let’s say that we require 10 points in one dimension: now 10^p points are required in p dimensions to pave the [0, 1] space. As p becomes large, the number of training points required for a good estimator grows exponentially.
For example, if each point is just a single number (8 bytes), then an effective k-NN estimator in a paltry p \sim 20 dimensions would require more training data than the current estimated size of the entire internet (±1000 Exabytes or so).
This is called the curse of dimensionality and is a core problem that machine learning addresses.

2.2.2 Linear model: from regression to sparsity

2.2.2.1 Linear regression 线性回归

# -*- coding: utf-8 -*-
from sklearn import datasets
from sklearn import linear_model
import numpy as np
import matplotlib.pyplot as plt
# 导入数据
diabetes = datasets.load_diabetes()
data_train = diabetes.data[:-20]
data_train_label = diabetes.target[:-20]
data_test = diabetes.data[-20:]
data_test_label = diabetes.target[-20:]
# 分类
regression = linear_model.LinearRegression()
regression.fit(data_train, data_train_label)
print regression.coef_  # 输出系数
print np.mean((regression.predict(data_test)-data_test_label)**2)  # 方差平均值
print regression.score(data_test, data_test_label)  # 1：perfect

2.2.3 Support vector machines (SVMs)

Linear SVMs

from sklearn import svm
svc = svm.SVC(kernel='linear')
svc.fit(iris_X_train, iris_y_train)

using kernels 使用核函数

svc = svm.SVC(kernel='linear')
svc = svm.SVC(kernel='poly',degree=3)
vc = svm.SVC(kernel='rbf')

2.3 Model selection: choosing estimators and their parameters

2.3.1 Score, and cross-validated scores

每一个estimator都有一个评分函数，数值越大越好

>>> from sklearn import datasets, svm
>>> digits = datasets.load_digits()
>>> X_digits = digits.data
>>> y_digits = digits.target
>>> svc = svm.SVC(C=1, kernel='linear')
>>> svc.fit(X_digits[:-100], y_digits[:-100]).score(X_digits[-100:], y_digits[-100:])
0.97999999999999998

为了更好地测量预测精度（我们可以将其用作模型拟合优势），我们可以连续地分割我们用于训练和测试的折叠数据：

>>> import numpy as np
>>> X_folds = np.array_split(X_digits, 3)
>>> y_folds = np.array_split(y_digits, 3)
>>> scores = list()
>>> for k in range(3):
...     # We use 'list' to copy, in order to 'pop' later on
...     X_train = list(X_folds)
...     X_test  = X_train.pop(k)
...     X_train = np.concatenate(X_train)
...     y_train = list(y_folds)
...     y_test  = y_train.pop(k)
...     y_train = np.concatenate(y_train)
...     scores.append(svc.fit(X_train, y_train).score(X_test, y_test))
>>> print(scores)
[0.93489148580968284, 0.95659432387312182, 0.93989983305509184]

2.3.2 Cross-validation generators

sklearn内置了一个类，该类可以产生训练集/测试集的索引

>>> from sklearn.model_selection import KFold, cross_val_score
>>> X = ["a", "a", "b", "c", "c", "c"]
>>> k_fold = KFold(n_splits=3)
>>> for train_indices, test_indices in k_fold.split(X):
...      print('Train: %s | test: %s' % (train_indices, test_indices))
Train: [2 3 4 5] | test: [0 1]
Train: [0 1 4 5] | test: [2 3]
Train: [0 1 2 3] | test: [4 5]

如此以来，就可以轻而易举地进行交叉验证

>>> [svc.fit(X_digits[train], y_digits[train]).score(X_digits[test], y_digits[test])
...          for train, test in k_fold.split(X_digits)]
[0.93489148580968284, 0.95659432387312182, 0.93989983305509184]

使用cross_val_score可以直接进行评分

>>> cross_val_score(svc, X_digits, y_digits, cv=k_fold, n_jobs=-1)
array([ 0.93489149,  0.95659432,  0.93989983])

2.3.3 Grid-search and cross-validated estimators

Grid-search

scikit-learn provides an object that, given data, computes the score during the fit of an estimator on a parameter grid and chooses the parameters to maximize the cross-validation score. This object takes an estimator during the construction and exposes an estimator API:
scikit-learn提供了一个对象，给定的数据在参数网格中的估计器拟合期间计算分数，并选择参数以最大化交叉验证分数。该对象在构建期间使用估计器并公开估计器API：

>>> from sklearn.model_selection import GridSearchCV, cross_val_score
>>> Cs = np.logspace(-6, -1, 10)
>>> clf = GridSearchCV(estimator=svc, param_grid=dict(C=Cs),
...                    n_jobs=-1)
>>> clf.fit(X_digits[:1000], y_digits[:1000])        
GridSearchCV(cv=None,...
>>> clf.best_score_                                  
0.925...
>>> clf.best_estimator_.C                            
0.0077...
>>> # Prediction performance on test set is not as good as on train set
>>> clf.score(X_digits[1000:], y_digits[1000:])      
0.943...

scikit-learn tutorial

1. introduction

1.1 Model persistence 模型存储

1.1.1 方式1

1.1.2 方式2

2. A tutorial on statistical-learning for scientific data processing 关于科学数据处理的统计学习教程

2.1 Statistical learning: the setting and the estimator object in scikit-learn 统计学习：scikit-learn中的设置及估计对象

2.1.1 Datasets 数据集

2.1.2 Estimators objects 估计对象

2.2 Supervised learning: predicting an output variable from high-dimensional observations 监督学习：由高维对象预测输出值

2.2.1 k-邻近算法和维度诅咒

2.2.1.1 k-邻近算法

2.2.1.2 维度诅咒

2.2.2 Linear model: from regression to sparsity

2.2.2.1 Linear regression 线性回归

2.2.3 Support vector machines (SVMs)

2.3 Model selection: choosing estimators and their parameters

2.3.1 Score, and cross-validated scores

2.3.2 Cross-validation generators

2.3.3 Grid-search and cross-validated estimators

内容目录