@chanvee 2015-05-20T11:28:35.000000Z 字数 3113 阅读 4586

Feature selection

Python 数据挖掘

模块sklearn.feature_selection可以用来在样本集上做特征选取/降维（feature selection/dimensionality reduction），要么提高预测器的准确性，要么可以提高其在高维数据集上的表现。

Removing features with low variance

[VarianceThreshold](VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples.) 是特征选取的一种基本方法。它会去除掉特征中所有不满足某个阈值的特征。默认情况下，它会去除掉所有zero-variance的特征, 比如说那些在所有样本上取值都一样的特征。

举个例子，假设我们有一个包含布尔特征（Boolean features）的数据集，然后我们想去除所有的在样本中超过80%的全为1或是全为0的特征。由于Boolean特征是伯努利随机变量，变量的方差如下：

V a r [X] = p (1 - p)

$\mathrm{Var}[X] = p(1 - p)$

所以我们可以选择阈值0.8 * (1 - 0.8):

%doctest_mode #removing >>> manually
>>> from sklearn.feature_selection import VarianceThreshold
>>> X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
>>> sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
>>> sel.fit_transform(X)

Exception reporting mode: Context
Doctest mode is: OFF





array([[0, 1],
       [1, 0],
       [0, 0],
       [1, 1],
       [1, 0],
       [1, 1]])

如所期望的，VarianceThreshold去除掉了第一列,因为其为0的概率 p = 5/6 > 0.8。

Univariate feature selection

单变量特征选取(Univariate feature selection)是基于单变量统计测试来选取最好的特征。它可以看作是预测器的一个预处理步骤。 Scikit-learn 通过以下的转变方法来实现特征的选取：
- SelectKBest 保留得分最高的K个特征
- SelectPercentile 保留用户指定的最高的百分比的特征
- 每个特征采用常用的单变量测试：false positive rate SelectFpr, false discovery rate SelectFdr, or family wise error SelectFwe.
- GenericUnivariateSelect可以用来做单变量特征

举个例子，我们用 $\chi^2$ 测试来得到最好的两个特征，如下:

>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectKBest
>>> from sklearn.feature_selection import chi2
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape




(150, 4)




>>> X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
>>> X_new.shape




(150, 2)

这些对象可以以得分函数作为输入返回单变量的 p-values:
- For regression: f_regression
- For classification: chi2 or f_classif