@qqiseeu 2015-10-29T06:43:48.000000Z 字数 9131 阅读 9241

Numpy with Cython：记一次对SGD算法的优化

概述

本文所用的例子来自一个编程作业。在该作业中要求实现一个集成学习算法，其中的主要部分就是训练多个感知机（Perceptron）。这里的感知机就是一个简单的线性分类器 $f(\mathbf{x})=\text{sign}(\mathbf{w}^\top\mathbf{x}+b)$ 。假设训练样本为

D = {(x 1, y 1), \dots, (x N, y N)}, x i \in ℝ D, y i \in {0, 1}

$D=\{(\mathbf{x_1}, y_1),\ldots, (\mathbf{x}_N, y_N)\}, \quad\mathbf{x}_i\in\mathbb{R}^D, y_i\in\{0, 1\}$
取

y′i=2yi−1 $y'_i=2y_i-1$ ，则训练过程中参数的基本更新规则为

w (t + 1) = w (t) + η (t) y' i x i if y' i (x ⊤ i w (t) + b (t)) < 0

$\mathbf{w}^{(t+1)}=\mathbf{w}^{(t)}+\eta^{(t)} y'_i\mathbf{x}_i \quad\text{if}\quad y'_i(\mathbf{x}_i^\top\mathbf{w}^{(t)}+b^{(t)})<0$
这其实就是基于hinge loss的SGD算法，其中

t $t$ 是迭代轮数，

η(t) $\eta^{(t)}$ 是第

t $t$ 轮的步长，

xi $\mathbf{x}_i$ 是随机sample的一个instance。在具体实现时可能会对上述规则做一些细微的修改，但基本规则不会有什么变化。

对于科学计算的任务（尤其是涉及到矩阵操作的任务），已经存在很好的Python数值计算库：Numpy与Scipy。其中Numpy实现了一个高效的（稠密）矩阵类型——numpy.ndarray，Scipy则提供了多种稀疏矩阵类型，且两个库之间还具有很好的互操作性。写过Matlab的同学肯定知道这样一条要诀：“任何能用矩阵操作的地方都不要用循环”。在用Python做科学计算时也有相同的原则，因为Numpy提供的矩阵操作其实都是对底层C程序的封装，可以完全离开Python解释器工作，因此速度相比于用for循环会快上很多倍（不过Python的解释器其实已经很快了，如果以后Pypy对Numpy的支持能够得到完善，那有些时候直接用for循环也是可以接受的）。但是总有这么些个算法，用矩阵操作来写就是不方便，而这里我们将要实现的SGD算法就是其中之一。

为什么Python这么慢？主要原因有两个：

Python是一门解释型的（interpreted）语言，Python源代码在执行时会先编译成一种特殊的字节码（bytecode），然后交予Python虚拟机负责执行，这比直接编译成二进制的可执行文件（C/C++就是这样）来说
就慢了不少。
Python是一门动态类型的（dynamically typed）语言，变量的具体类型只有在运行时才可知。事实上，Python解释器大部分时间都在干这么件事：弄清楚一个变量的当前类型到底是什么，然后把它底层的数据
抽取出来进行运算。例如对a+b这条语句，解释器首先要查询某个表确定a和b的当前类型，然后确定该类型是否支持+操作，并把实际执行该操作的函数提取出来，接下来把a和b作为参数传给上述函数，
而在该函数内部，则得先把a、b底层的数据取出来，完成实际的运算，再把结果包装成一个Python Object返回。整个过程会产生多次查询操作，且产生生成新的Python Object还需要在堆（heap）上分配
空间。如果是C代码，由于已知a与b的类型，这个加法操作不过是一两条CPU指令的功夫。

Python社区为这个问题提供了一个方便的解决方案：Cython。你可以简单地认为Cython是一种介于Python和C之间的编程语言，它为Python引入了C的静态类型系统（static type system），使得
我们可以将Python代码和C代码无缝衔接。除此之外Cython还可以封装C/C++的动态链接库并提供针对Python的接口。最关键的是，用Cython操作numpy.ndarray非常方便，通过调用numpy.ndarray.data可以直接访问其在底层封装的C数组。

Step 0：原始Python程序

下面是初版的程序，注意这是一个纯Python程序：

# sgd.py
from __future__ import division
import numpy as np
import numpy.random as random
def sgd(X, Y, ep):
    """ Train a simple perceptron using SGD
    Parameters
    ----------
    X: array of shape=(n_samples, n_features) with dtype=np.float64
        Each row contains the feature vector for a training sample instance.
    Y: array of shape=(n_samples, ) with dtype=np.int64 ranging in {0, 1}
        Corresponding label of training instances.
    ep: double
        Desired accuracy, which controls max iteration number.
    Returns
    -------
    w: array of shape=(n_features, ) with dtype=np.float64
        Weights vector of the linear classifier
    b: double
        Intercept term of the linear classifier
    """
    n_samples, n_features = X.shape
    w = random.randn(n_features)
    w_previous = w.copy() + 1
    b = 0.0
    t = 0
    T = 1 / ep ** 2
    idx = np.arange(n_samples)
    while t < T:
        # check stop criterion
        if np.linalg.norm(w_previous - w) <= 0.0001:
            break
        else:
            w_previous[:] = w
            random.shuffle(idx)
        # update weights vector and intercept
        for j in idx:
            lx = 2 * Y[j] - 1
            if (w.dot(X[j].T) + b) * lx < 0:
                step_size = 0.1 / (0.1 + t * ep ** 2)
                w += step_size * X[j] * lx
                b += step_size * lx
            t += 1
    return w, b

这里有几个要点：

主要的训练过程发生在第44-50行的内循环中，这也将是我们优化的重点。
第37-38行，每遍历数据集一次就检查权值向量 $\mathbf{w}$ 的更新情况，若其变化不大就提前跳出循环。
第41行，用“先shuffle再顺序遍历”的方法来模拟SGD每次迭代中对 $\mathbf{x}_i$ 的随机采样。

Step 1：明确标示所有变量的类型

首先，为了区别Cython文件与Python文件，我们要把文件名改为sgd.pyx。其实这时候直接就可以用Cython编译它了，因为Cython的语法和Python的语法是兼容的：

$ cython sgd.pyx

但是这样做呢程序的性能并不会提升多少，我们还需要加点其它的东西进去，来帮助cython生成更高效的C代码。

第一件事就是显示声明每个变量的类型。正如文章开始时说过的，Python是动态类型的语言，每个变量的类型是在运行时决定的，若不声明变量类型，则Cython默认每个变量都是PyObject类型的，这个类型在Cython中用来指代Python对象。Cython操作这种类型的变量时和Python解释器的手法是一样的，因此会慢很多。

#sgd.pyx
from __future__ import division
import numpy as np
cimport numpy as np
cimport cython
import numpy.random as random
DTYPE = np.double
ctypedef np.double_t DTYPE_t
TTYPE = np.int64
ctypedef np.int64_t TTYPE_t
@cython.cdivision(True)
@cython.boundscheck(False)
def sgd(np.ndarray[DTYPE_t, ndim=2] X,
        np.ndarray[TTYPE_t, ndim=1] Y,
        double ep):
    """ Train a simple perceptron using SGD
    Parameters
    ----------
    X: array of shape=(n_samples, n_features) with dtype=np.float64
        Each row contains the feature vector for a training sample instance.
    Y: array of shape=(n_samples, ) with dtype=np.int64 ranging in {0, 1}
        Corresponding label of training instances.
    ep: double
        Desired accuracy, which controls max iteration number.
    Returns
    -------
    w: array of shape=(n_features, ) with dtype=np.float64
        Weights vector of the linear classifier
    b: double
        Intercept term of the linear classifier
    """
    cdef int n_samples = X.shape[0]
    cdef int n_features = X.shape[1]
    cdef np.ndarray[DTYPE_t, ndim=1] w = random.randn(n_features)
    cdef np.ndarray[DTYPE_t, ndim=1] w_previous = w.copy() + 1
    cdef double b = 1.0
    cdef unsigned T = <unsigned>(1 / ep ** 2)
    cdef unsigned t = 0
    cdef long lx
    cdef unsigned j
    cdef double step_size
    cdef np.ndarray[np.uint32_t, ndim=1] idx = np.arange(n_samples, np.uint32)
    while t < T:
        # check stop criterion
        if np.linalg.norm(w_previous - w) <= 0.0001:
            break
        else:
            w_previous[:] = w
            random.shuffle(idx)
        # update weights vector and intercept
        for j in idx:
            lx = 2 * Y[j] - 1
            if (w.dot(X[j].T) + b) * lx < 0:
                step_size = 0.1 / (0.1 + t * ep ** 2)
                w += step_size * X[j] * lx
                b += step_size * lx
            t += 1
    return w, b

在Cython中，使用诸如cdef int x的语句来声明指定类型的变量，其中cdef是Cython关键字，int是类型（C类型或是通过cdef得到的扩展类型)，x是变量名。例子见下述代码中的第36、37行。
对于numpy.ndarray，Cython为之专门提供了用cdef重新封装过的版本（准确地说是用cdef定义了一个扩展类型）。为了使用该版本，需要用cimport把numpy重新导入一次（见第4行）。不过不用担心它和之前用import导入的numpy发生冲突，Cython会自动区分。cimport是Cython专用的语法，表示导入的是Cython模块（一般是一个.pxd文件）。
用cdef声明numpy.ndarray类型时，最好能够指出其中的数据类型以及矩阵的维度（例如函数参数中对X声明，见第15行）。多数时候这能大大提高编译后的代码中对矩阵的读写速度。不过注意这种优化方式仅当你使用“C-like indexing”（用D个typed variable索引维度为D的array，例如在本节代码中对Y的索引Y[j]）的时候才起作用，而若是使用Numpy的“fancy indexing”功能（例如在本节代码中对X每一行的索引X[j]）则起不到优化的效果。由于上述原因，这项优化在上述SGD代码中带来的性能提升并不明显。
numpy提供了各种数据类型，包括int32、float64等，分别对应C语言中的32位有符号整数、64位浮点数。C语言中的基本数据类型都有numpy的对应版本。但这些类型不能直接与cdef结合使用，因为在numpy中这些类型都是Python对象。通过cimport numpy这条语句，我们可以得到上述类型的C版本的定义，只需要在原来的类型名字末尾加上_t即可，例子见第9、11行。
numpy.ndarray支持负下标索引：类似Python中的list，-1表示最后一个元素，-2则是倒数第二个，依次类推。Cython默认开启对该功能的支持，而这也会减慢速度。关闭它的方法很简单：用unsigned int类型的变量进行矩阵索引，第46、48行声明的变量就是起这个作用。
Cython默认进行数组越界检查，这也是个降低速度的地方，不过可以用@cython.boundscheck(False)的方法关闭该功能（见第14行）

Step 2：添加对稀疏矩阵的支持

之前的程序默认输入的数据保存在numpy.ndarray对象中，这是一个稠密矩阵类型，在数据很稀疏时就不好用了。scipy.sparse提供了几种存储方式的稀疏矩阵，其底层其实都是用一个ndarray保存非零数据，再用几个ndarray保存数据的索引，唯一不同的只是构建索引的方式。最关键的是，这些底层的东西都可以通过某些办法直接操作的！

根据上面的思路，我模仿sklearn.utils.seq_dataset及sklearn.utils.weight_vector实现了两个容器SequentialDataset及WeightVector，但是针对自己的需要做了一些修改。其中SequentialDataset是用来操作训练数据集(X、Y)的，支持顺序存取及shuffle操作；WeightVector则是用来操作权值向量的，支持向量加法、内积、复制等操作。

由于篇幅所限，上述两个容器的代码实现不列出在这里，仅在后面做出简要说明，有兴趣同学可以去看sklearn中的源码[5]，毕竟思路是来自那里。下面列出修改后的sgd代码：

# cython: cdivision=True
# cython: boundscheck=False
# cython: wraparound=False
from __future__ import division
cimport cython
import numpy as np
cimport numpy as np
from .container cimport SequentialDataset
from .container cimport WeightVector
from .container cimport DTYPE_t, YTYPE_t
def sgd(SequentialDataset dataset,
        np.ndarray[DTYPE_t, ndim=1, mode='c'] weights,
        double intercept,
        double ep,
        bint shuffle, np.uint32_t seed):
    """Train a perceptron-like linear classifier using Stochastic Gradient
    Descent.
    Parameters
    ----------
    dataset: SequentialDataset,
        Training samples.
    weights: array of shape = [n_features] with dtype = np.DTYPE.
        Allocated weight vector.
    intercept: DTYPE,
        initial intercept value.
    ep: double.
        error tolerance, controls the iteration numbers. Training time will
        be roughly O(1/min_node_err^2).
    Returns
    -------
    weights : array, shape=[n_features]
        The fitted weight vector.
    intercept : double
        The fitted intercept term.
    """
    # acquire data information
    cdef Py_ssize_t n_samples = dataset.n_samples
    cdef Py_ssize_t n_features = weights.shape[0]
    cdef unsigned int n_iter = max(np.int(1 / ep ** 2), 1)
    # some helper variables:
    # for iteration
    cdef unsigned int n_outer = max(n_iter, n_samples)
    cdef unsigned int n_inner = min(n_iter, n_samples)
    cdef unsigned int t = 0
    cdef np.int64_t j = 0
    # for weight vector
    cdef np.ndarray[DTYPE_t, ndim=1] w_previous = weights.copy() + 1
    cdef WeightVector w = WeightVector(weights)
    cdef WeightVector wp = WeightVector(w_previous)
    cdef DTYPE_t* w_data_ptr = NULL
    cdef np.ndarray[int, ndim=1] w_indices = np.arange(n_features).astype(np.int32)
    cdef int* w_ind_ptr = <int *>w_indices.data
    cdef int wnnz = <int>n_features
    # for sample instance
    cdef DTYPE_t *x_data_ptr = NULL
    cdef int *x_ind_ptr = NULL
    cdef int xnnz
    cdef YTYPE_t y
    # for update
    cdef double step_size = 0
    cdef YTYPE_t lx = 0
    cdef double p = 0.0
    with nogil:
        while t < n_outer:
            if shuffle:
                dataset.shuffle(seed)
            for j in range(n_inner):
                # get next sample
                dataset.next(&x_data_ptr,
                             &x_ind_ptr,
                             &xnnz,
                             &y)
                lx = 2 * y - 1
                # update weights with hinge loss
                p = (w.dot(x_data_ptr, x_ind_ptr, xnnz) + intercept) * lx
                if p < 0:
                    step_size = lx * 0.1 / (0.1 + t * ep ** 2)
                    w.add(x_data_ptr, x_ind_ptr, xnnz, step_size)
                    intercept += step_size
                t += 1
            # check for early stopping
            w_data_ptr = w.w_data_ptr
            wp.add(w_data_ptr, w_ind_ptr, wnnz, -1)
            if t < n_iter and wp.norm() <= 0.0001:
                break
            else:
                wp.assign(w)
    return weights, intercept

注意在第73行，我使用with nogil显示地释放了GIL
其实在这里可以利用并行化来加速：在计算向量内积时使用openmp。（PS. 在Cython中使用openmp非常方便，具体见3。
第14行，参数中的dataset变量代替了原本的X、Y矩阵，其提供了shuffle操作（见第76行）及next操作（第79行）。
其中next函数的作用是返回数据集中下一个样本。注意它返回向量X[i]的方式：使用三个量来描述一个向量，分别是x_data_ptr、x_ind_ptr及xnnz，可以近似地认为x_data_ptr是保存X[i]中非零元素的数组，x_ind_ptr则保存了各个非零元在X[i]中的列标，xnnz则是非零元的个数。这种表示方式源自CSR稀疏矩阵[4]。如果你去看了SequentialDataset和WeightVector的实现的话，会发现它们根本就是C代码，各种指针运算漫天飞，而最新版本的Cython提供了typed memoryview机制，可以达到
同样的目的，但是安全许多，这里不展开讨论。
第55行，用WeightVector变量代替了原本的numpy.ndarray。其提供四个操作：add、dot、norm及assign。前两个函数的参数列表类似SequentialDataset.next，是为了配合与样本向量的计算。