Avoid using Sigmoid
机器学习
Sigmoid was popular as activision function, but it should not be used.
- Saturated neurons "kill" the gradients.
For example, ∂σ∂x=10=4.5e−5.
- Sigmoid outputs are not zero-centered.
If input to a neuron is always positive, then gradients on w are always positive or negative.
- exp is a bit compute expensive.
tanh squashes numbers to range [−1,1]. It's zero centered (nice), but kills gradients still.
ReLU is a great idea.