[关闭]
@KarlYan95 2017-10-08T14:30:03.000000Z 字数 7773 阅读 619

scikit-learn tutorial

Python


1. introduction

1.1 Model persistence 模型存储

1.1.1 方式1

It is possible to save a model in the scikit by using Python’s built-in persistence model, namely pickle:
可以通过使用Python的内置持久性模型(即pickle)将模型保存在scikit中:

  1. >>> from sklearn import svm
  2. >>> from sklearn import datasets
  3. >>> clf = svm.SVC()
  4. >>> iris = datasets.load_iris()
  5. >>> X, y = iris.data, iris.target
  6. >>> clf.fit(X, y)
  7. SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  8. decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  9. max_iter=-1, probability=False, random_state=None, shrinking=True,
  10. tol=0.001, verbose=False)
  11. >>> import pickle
  12. >>> s = pickle.dumps(clf)
  13. >>> clf2 = pickle.loads(s)
  14. >>> clf2.predict(X[0:1])
  15. array([0])
  16. >>> y[0]
  17. 0

1.1.2 方式2

In the specific case of the scikit, it may be more interesting to use joblib’s replacement of pickle (joblib.dump & joblib.load), which is more efficient on big data, but can only pickle to the disk and not to a string:

在scikit的具体情况下,使用joblib替换pickle(joblib.dump和joblib.load)可能会更有趣,这对大数据效率更高,但只能存储到磁盘而不是字符串:

  1. >>> from sklearn.externals import joblib
  2. >>> joblib.dump(clf, 'filename.pkl')

Later you can load back the pickled model (possibly in another Python process) with:

  1. >>>
  2. >>> clf = joblib.load('filename.pkl')

2. A tutorial on statistical-learning for scientific data processing 关于科学数据处理的统计学习教程

Statistical learning
Machine learning is a technique with a growing importance, as the size of the datasets experimental sciences are facing is rapidly growing. Problems it tackles range from building a prediction function linking different observations, to classifying observations, or learning the structure in an unlabeled dataset.
This tutorial will explore statistical learning, the use of machine learning techniques with the goal of statistical inference: drawing conclusions on the data at hand.
Scikit-learn is a Python module integrating classic machine learning algorithms in the tightly-knit world of scientific Python packages (NumPy, SciPy, matplotlib).
统计学习
机器学习是一个越来越重要的技术,因为数据集实验科学面临的规模正在迅速增长。 它处理的问题范围从构建一个连接不同观察值的预测函数,分类观察值,或者在未标记的数据集中学习结构。
本教程将探讨统计学习,使用机器学习技术与统计推断的目标:对手头的数据作出结论。
Scikit-learning是一个Python模块,将经典机器学习算法整合在科学Python包(NumPy,SciPy,matplotlib)紧密编织的世界中。

2.1 Statistical learning: the setting and the estimator object in scikit-learn 统计学习:scikit-learn中的设置及估计对象

2.1.1 Datasets 数据集

Scikit-learn deals with learning information from one or more datasets that are represented as 2D arrays. They can be understood as a list of multi-dimensional observations. We say that the first axis of these arrays is the samples axis, while the second is the features axis.

sklearn处理来自表示为2D数组的一个或多个数据集的学习信息。它们可以被理解为多维观察的列表。我们说这些阵列的第一个轴是样本轴,而第二个轴是特征轴。

  1. >>> from sklearn import datasets
  2. >>> iris = datasets.load_iris()
  3. >>> data = iris.data
  4. >>> data.shape
  5. (150, 4)
  1. from sklearn import datasets
  2. digits = datasets.load_digits()
  3. print digits.images.shape # (1797, 8, 8)
  4. data = digits.images.reshape((digits.images.shape[0], -1))
  5. print data.shape # (1797, 64)

2.1.2 Estimators objects 估计对象

2.2 Supervised learning: predicting an output variable from high-dimensional observations 监督学习:由高维对象预测输出值

2.2.1 k-邻近算法和维度诅咒

2.2.1.1 k-邻近算法
  1. from sklearn import datasets
  2. import numpy as np
  3. from sklearn.neighbors import KNeighborsClassifier
  4. # 导入数据
  5. iris = datasets.load_iris()
  6. data = iris.data
  7. label = iris.target
  8. # 生成随机index
  9. np.random.seed(0)
  10. indices = np.random.permutation(len(data))
  11. # split data
  12. data_train = data[indices[:-10]]
  13. data_train_label = label[indices[:-10]]
  14. data_test = data[indices[-10:]]
  15. data_test_label = label[indices[-10:]]
  16. # 分类
  17. knn = KNeighborsClassifier()
  18. knn.fit(data_train, data_train_label)
  19. print knn.predict(data_test) # [1 2 1 0 0 0 2 1 2 0]
  20. print data_test_label # [1 1 1 0 0 0 2 1 2 0]
2.2.1.2 维度诅咒

For an estimator to be effective, you need the distance between neighboring points to be less than some value d, which depends on the problem. In one dimension, this requires on average n \sim 1/d points. In the context of the above k-NN example, if the data is described by just one feature with values ranging from 0 to 1 and with n training observations, then new data will be no further away than 1/n. Therefore, the nearest neighbor decision rule will be efficient as soon as 1/n is small compared to the scale of between-class feature variations.
If the number of features is p, you now require n \sim 1/d^p points. Let’s say that we require 10 points in one dimension: now 10^p points are required in p dimensions to pave the [0, 1] space. As p becomes large, the number of training points required for a good estimator grows exponentially.
For example, if each point is just a single number (8 bytes), then an effective k-NN estimator in a paltry p \sim 20 dimensions would require more training data than the current estimated size of the entire internet (±1000 Exabytes or so).
This is called the curse of dimensionality and is a core problem that machine learning addresses.

2.2.2 Linear model: from regression to sparsity

2.2.2.1 Linear regression 线性回归
  1. # -*- coding: utf-8 -*-
  2. from sklearn import datasets
  3. from sklearn import linear_model
  4. import numpy as np
  5. import matplotlib.pyplot as plt
  6. # 导入数据
  7. diabetes = datasets.load_diabetes()
  8. data_train = diabetes.data[:-20]
  9. data_train_label = diabetes.target[:-20]
  10. data_test = diabetes.data[-20:]
  11. data_test_label = diabetes.target[-20:]
  12. # 分类
  13. regression = linear_model.LinearRegression()
  14. regression.fit(data_train, data_train_label)
  15. print regression.coef_ # 输出系数
  16. print np.mean((regression.predict(data_test)-data_test_label)**2) # 方差平均值
  17. print regression.score(data_test, data_test_label) # 1:perfect

2.2.3 Support vector machines (SVMs)

  1. from sklearn import svm
  2. svc = svm.SVC(kernel='linear')
  3. svc.fit(iris_X_train, iris_y_train)
  1. svc = svm.SVC(kernel='linear')
  2. svc = svm.SVC(kernel='poly',degree=3)
  3. vc = svm.SVC(kernel='rbf')

2.3 Model selection: choosing estimators and their parameters

2.3.1 Score, and cross-validated scores

  1. >>> from sklearn import datasets, svm
  2. >>> digits = datasets.load_digits()
  3. >>> X_digits = digits.data
  4. >>> y_digits = digits.target
  5. >>> svc = svm.SVC(C=1, kernel='linear')
  6. >>> svc.fit(X_digits[:-100], y_digits[:-100]).score(X_digits[-100:], y_digits[-100:])
  7. 0.97999999999999998
  1. >>> import numpy as np
  2. >>> X_folds = np.array_split(X_digits, 3)
  3. >>> y_folds = np.array_split(y_digits, 3)
  4. >>> scores = list()
  5. >>> for k in range(3):
  6. ... # We use 'list' to copy, in order to 'pop' later on
  7. ... X_train = list(X_folds)
  8. ... X_test = X_train.pop(k)
  9. ... X_train = np.concatenate(X_train)
  10. ... y_train = list(y_folds)
  11. ... y_test = y_train.pop(k)
  12. ... y_train = np.concatenate(y_train)
  13. ... scores.append(svc.fit(X_train, y_train).score(X_test, y_test))
  14. >>> print(scores)
  15. [0.93489148580968284, 0.95659432387312182, 0.93989983305509184]

2.3.2 Cross-validation generators

  1. >>> from sklearn.model_selection import KFold, cross_val_score
  2. >>> X = ["a", "a", "b", "c", "c", "c"]
  3. >>> k_fold = KFold(n_splits=3)
  4. >>> for train_indices, test_indices in k_fold.split(X):
  5. ... print('Train: %s | test: %s' % (train_indices, test_indices))
  6. Train: [2 3 4 5] | test: [0 1]
  7. Train: [0 1 4 5] | test: [2 3]
  8. Train: [0 1 2 3] | test: [4 5]
  1. >>> [svc.fit(X_digits[train], y_digits[train]).score(X_digits[test], y_digits[test])
  2. ... for train, test in k_fold.split(X_digits)]
  3. [0.93489148580968284, 0.95659432387312182, 0.93989983305509184]
  1. >>> cross_val_score(svc, X_digits, y_digits, cv=k_fold, n_jobs=-1)
  2. array([ 0.93489149, 0.95659432, 0.93989983])

2.3.3 Grid-search and cross-validated estimators

  1. >>> from sklearn.model_selection import GridSearchCV, cross_val_score
  2. >>> Cs = np.logspace(-6, -1, 10)
  3. >>> clf = GridSearchCV(estimator=svc, param_grid=dict(C=Cs),
  4. ... n_jobs=-1)
  5. >>> clf.fit(X_digits[:1000], y_digits[:1000])
  6. GridSearchCV(cv=None,...
  7. >>> clf.best_score_
  8. 0.925...
  9. >>> clf.best_estimator_.C
  10. 0.0077...
  11. >>> # Prediction performance on test set is not as good as on train set
  12. >>> clf.score(X_digits[1000:], y_digits[1000:])
  13. 0.943...
添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注