@KarlYan95
2017-10-08T14:30:03.000000Z
字数 7773
阅读 619
Python
It is possible to save a model in the scikit by using Python’s built-in persistence model, namely pickle:
可以通过使用Python的内置持久性模型(即pickle)将模型保存在scikit中:
>>> from sklearn import svm
>>> from sklearn import datasets
>>> clf = svm.SVC()
>>> iris = datasets.load_iris()
>>> X, y = iris.data, iris.target
>>> clf.fit(X, y)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
>>> import pickle
>>> s = pickle.dumps(clf)
>>> clf2 = pickle.loads(s)
>>> clf2.predict(X[0:1])
array([0])
>>> y[0]
0
In the specific case of the scikit, it may be more interesting to use joblib’s replacement of pickle (joblib.dump & joblib.load), which is more efficient on big data, but can only pickle to the disk and not to a string:
在scikit的具体情况下,使用joblib替换pickle(joblib.dump和joblib.load)可能会更有趣,这对大数据效率更高,但只能存储到磁盘而不是字符串:
>>> from sklearn.externals import joblib
>>> joblib.dump(clf, 'filename.pkl')
Later you can load back the pickled model (possibly in another Python process) with:
>>>
>>> clf = joblib.load('filename.pkl')
Statistical learning
Machine learning is a technique with a growing importance, as the size of the datasets experimental sciences are facing is rapidly growing. Problems it tackles range from building a prediction function linking different observations, to classifying observations, or learning the structure in an unlabeled dataset.
This tutorial will explore statistical learning, the use of machine learning techniques with the goal of statistical inference: drawing conclusions on the data at hand.
Scikit-learn is a Python module integrating classic machine learning algorithms in the tightly-knit world of scientific Python packages (NumPy, SciPy, matplotlib).
统计学习
机器学习是一个越来越重要的技术,因为数据集实验科学面临的规模正在迅速增长。 它处理的问题范围从构建一个连接不同观察值的预测函数,分类观察值,或者在未标记的数据集中学习结构。
本教程将探讨统计学习,使用机器学习技术与统计推断的目标:对手头的数据作出结论。
Scikit-learning是一个Python模块,将经典机器学习算法整合在科学Python包(NumPy,SciPy,matplotlib)紧密编织的世界中。
Scikit-learn deals with learning information from one or more datasets that are represented as 2D arrays. They can be understood as a list of multi-dimensional observations. We say that the first axis of these arrays is the samples axis, while the second is the features axis.
sklearn处理来自表示为2D数组的一个或多个数据集的学习信息。它们可以被理解为多维观察的列表。我们说这些阵列的第一个轴是样本轴,而第二个轴是特征轴。
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> data = iris.data
>>> data.shape
(150, 4)
from sklearn import datasets
digits = datasets.load_digits()
print digits.images.shape # (1797, 8, 8)
data = digits.images.reshape((digits.images.shape[0], -1))
print data.shape # (1797, 64)
fit(x, y)
来训练出model,通过predict(x)来返回所要预测的标签y
from sklearn import datasets
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
# 导入数据
iris = datasets.load_iris()
data = iris.data
label = iris.target
# 生成随机index
np.random.seed(0)
indices = np.random.permutation(len(data))
# split data
data_train = data[indices[:-10]]
data_train_label = label[indices[:-10]]
data_test = data[indices[-10:]]
data_test_label = label[indices[-10:]]
# 分类
knn = KNeighborsClassifier()
knn.fit(data_train, data_train_label)
print knn.predict(data_test) # [1 2 1 0 0 0 2 1 2 0]
print data_test_label # [1 1 1 0 0 0 2 1 2 0]
For an estimator to be effective, you need the distance between neighboring points to be less than some value d, which depends on the problem. In one dimension, this requires on average n \sim 1/d points. In the context of the above k-NN example, if the data is described by just one feature with values ranging from 0 to 1 and with n training observations, then new data will be no further away than 1/n. Therefore, the nearest neighbor decision rule will be efficient as soon as 1/n is small compared to the scale of between-class feature variations.
If the number of features is p, you now require n \sim 1/d^p points. Let’s say that we require 10 points in one dimension: now 10^p points are required in p dimensions to pave the [0, 1] space. As p becomes large, the number of training points required for a good estimator grows exponentially.
For example, if each point is just a single number (8 bytes), then an effective k-NN estimator in a paltry p \sim 20 dimensions would require more training data than the current estimated size of the entire internet (±1000 Exabytes or so).
This is called the curse of dimensionality and is a core problem that machine learning addresses.
# -*- coding: utf-8 -*-
from sklearn import datasets
from sklearn import linear_model
import numpy as np
import matplotlib.pyplot as plt
# 导入数据
diabetes = datasets.load_diabetes()
data_train = diabetes.data[:-20]
data_train_label = diabetes.target[:-20]
data_test = diabetes.data[-20:]
data_test_label = diabetes.target[-20:]
# 分类
regression = linear_model.LinearRegression()
regression.fit(data_train, data_train_label)
print regression.coef_ # 输出系数
print np.mean((regression.predict(data_test)-data_test_label)**2) # 方差平均值
print regression.score(data_test, data_test_label) # 1:perfect
from sklearn import svm
svc = svm.SVC(kernel='linear')
svc.fit(iris_X_train, iris_y_train)
svc = svm.SVC(kernel='linear')
svc = svm.SVC(kernel='poly',degree=3)
vc = svm.SVC(kernel='rbf')
>>> from sklearn import datasets, svm
>>> digits = datasets.load_digits()
>>> X_digits = digits.data
>>> y_digits = digits.target
>>> svc = svm.SVC(C=1, kernel='linear')
>>> svc.fit(X_digits[:-100], y_digits[:-100]).score(X_digits[-100:], y_digits[-100:])
0.97999999999999998
>>> import numpy as np
>>> X_folds = np.array_split(X_digits, 3)
>>> y_folds = np.array_split(y_digits, 3)
>>> scores = list()
>>> for k in range(3):
... # We use 'list' to copy, in order to 'pop' later on
... X_train = list(X_folds)
... X_test = X_train.pop(k)
... X_train = np.concatenate(X_train)
... y_train = list(y_folds)
... y_test = y_train.pop(k)
... y_train = np.concatenate(y_train)
... scores.append(svc.fit(X_train, y_train).score(X_test, y_test))
>>> print(scores)
[0.93489148580968284, 0.95659432387312182, 0.93989983305509184]
>>> from sklearn.model_selection import KFold, cross_val_score
>>> X = ["a", "a", "b", "c", "c", "c"]
>>> k_fold = KFold(n_splits=3)
>>> for train_indices, test_indices in k_fold.split(X):
... print('Train: %s | test: %s' % (train_indices, test_indices))
Train: [2 3 4 5] | test: [0 1]
Train: [0 1 4 5] | test: [2 3]
Train: [0 1 2 3] | test: [4 5]
>>> [svc.fit(X_digits[train], y_digits[train]).score(X_digits[test], y_digits[test])
... for train, test in k_fold.split(X_digits)]
[0.93489148580968284, 0.95659432387312182, 0.93989983305509184]
cross_val_score
可以直接进行评分
>>> cross_val_score(svc, X_digits, y_digits, cv=k_fold, n_jobs=-1)
array([ 0.93489149, 0.95659432, 0.93989983])
scikit-learn provides an object that, given data, computes the score during the fit of an estimator on a parameter grid and chooses the parameters to maximize the cross-validation score. This object takes an estimator during the construction and exposes an estimator API:
scikit-learn提供了一个对象,给定的数据在参数网格中的估计器拟合期间计算分数,并选择参数以最大化交叉验证分数。该对象在构建期间使用估计器并公开估计器API:
>>> from sklearn.model_selection import GridSearchCV, cross_val_score
>>> Cs = np.logspace(-6, -1, 10)
>>> clf = GridSearchCV(estimator=svc, param_grid=dict(C=Cs),
... n_jobs=-1)
>>> clf.fit(X_digits[:1000], y_digits[:1000])
GridSearchCV(cv=None,...
>>> clf.best_score_
0.925...
>>> clf.best_estimator_.C
0.0077...
>>> # Prediction performance on test set is not as good as on train set
>>> clf.score(X_digits[1000:], y_digits[1000:])
0.943...