@nrailgun 2016-10-31T12:24:28.000000Z 字数 2341 阅读 2071

MXNet

MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems

强力软件

MXNet is a

Lightweight, Portable, Flexible
Distributed

machine learning library to ease the development of ML algorithms, especially for deep neural networks. MXNet is computation and memory efficient and runs on various heterogeneous systems.

1. Introduction

The scale and complexity of machine learning algorithm are becoming incresingly large. Almost all recent ImageNet chanllenge winners employ neural networks with very deep layers, requiring billions of floating point operations to process one single sample. The rise of computational complexity poses interesting challenges to ML system design and implementation.

How the computation is carried out:

concrete: result is returned right away on the same thread,
asynchronized: statements are gathered and transformed into a dataflow graph as an intermediate representation first, before released to available devices.

Compare to other popular open-source ML libraries

System	Core language	Devices	Distributed
Caffe	C++	CPU / GPU
Torch	Lua	CPU / GPU / FPGA
TensorFlow	C++	CPU / GPU	$\checkmark$
MXNet	C++	CPU / GPU	$\checkmark$

2 Programming Interface

2.1 Symbol: Declarative Symbolic Expressions

2.2 NDArray: Imperative Tensor Computation

2.3 KVStore: Data Synchronization Over Devices

The KVStore is a distributed key-value store for data synchronization over multiple devices (machines, GPUs). It supports 2 primitives:

push a key-value pair from a device to the store,
pull the value on a key from the store. Finally, model divergence is controlled via consistency model. Currently, we support the sequantial and eventual consistency.

The following example implements the distributed gradient descent by data parallelization.

while (1) {
    kv.pull(net.w);
    net.forward_backward(); 
    kv.push(net.g);
}

where the weight updating function is registered to the KVStore, and each worker repeatedly pull the newest weight from the store and then pushes out the locally computed gradient.

The above mixed implementation has the same performance comparing to a single declarative program, because the actual data push and pull are executed by lazy evaluation, which are scheduled by the backend engine just like others.

3. Implementation

3.1 Computation Graph

3.2 Dependency Engine

3.3 Data communication

We implemented KVStore based on the parameter server. It differs to previous works in 2 aspects: First, we use the engine to schedule the KVStore operations and manage the data consistency. Second, we adopt an 2-level structure. Level 1 server managers the data synchronization between the devices withnin a single machine, while a level 2 server manages intermachine synchronization.