@songying 2018-08-03T01:21:10.000000Z 字数 1293 阅读 1042

Highway Networks

word-embedding

Abstract

随着神经网络层数的加深，其变得越来越难以训练。在本文中，我们引入了一个新的架构来减轻深度神经网络中gradient-based training。在该架构中，我们采用了一种门机制，在此机制下，一些信息流没有衰减的通过一些网络层，适用于SGD方法。

Introduction

In this extended abstract, we present a novel architecture that enables the optimization of networks with virtually arbitrary depth. This is accomplished through the use of a learned gating mechanism for regulating information flow which is inspired by Long Short Term Memory recurrent neural networks. Due to this gating mechanism, a neural network can have paths
along which information can flow across several layers without attenuation.

2. Highway Networks

对于一个简单的前馈神经网络而言，其公式如下：

$y = H(x, W_H)$
其中， H通常表示一个非线性的激活函数。
而在highway network中，我们添加了两个非线性的转换如下：

$y = H(x, W_H) \cdot T(x, W_T) + x \cdot C(x, W_C)$
其中， T称为 the transform gate， C称为 the carry gate，为了更加简明，我们这里设置

$C = 1-T$ , 那么则有：

$y = H(x, W_H) \cdot T(x, W_T) + x \cdot (1 - T(x, W_C))$

注意，此时， $x, y, H(x, W_H), T(x, W_T)$ 的维度必须相同，不够补零，最后得到：

$\begin{equation} \left\{ \begin{array}{lr} x, \qquad \qquad \qquad if \, \, T(x, W_T) = 0, \\ H(X, W_H), \qquad if \,\,T(x, W_T) = 1 \end{array} \right. \end{equation}$

我们因此得到了它的Jacobian ：

$\frac{dy}{dx} = \begin{cases} I, & if T(x, W_T) = 0 \\ H'(X, w_h), & if T(x, W_T) = 0 \end{cases}$

Thus, depending on the output of the transform gates, a highway layer can smoothly vary its behavior between that of a plain layer and that of a layer which simply passes its inputs through.

$y_i = H_i(x) * T_i(x) + x_i * (1 - T_i(x))$

Highway Networks

Abstract

Introduction

2. Highway Networks

2.1 Constructing Highway Networks

内容目录