It is known that rectified linear deep neural networks (RL-DNNs) can consistently outperform the conventional pre-trained sigmoid DNNs even with a random initialization. In this paper, we present another interesting and useful property of RL-DNNs that we can learn RL-DNNs with a very large batch size in stochastic gradient descent (SGD). Therefore, the SGD learning can be easily parallelized among multiple computing units for much better training efficiency. Moreover, we also propose a tied-scalar regularization technique to make the large-batch SGD learning of RL-DNNs more stable. Experimental results on the 309-hour Switchboard (SWB) task have shown that we can train RL-DNNs using batch sizes about 100 times larger than those used in the previous work, thus the learning of RL-DNNs can be accelerated by over 10 times when 8 GPUs are used. More importantly, we have achieved a word error rate of 13.8% with a 6-hidden-layer RL-DNN trained by the frame-level cross-entropy criterion with the tied-scalar regularization. To our knowledge, this is the best reported performance on this task under the same experimental settings.
Bibliographic reference. Zhang, Shiliang / Jiang, Hui / Wei, Si / Dai, Li-Rong (2015): "Rectified linear neural networks with tied-scalar regularization for LVCSR", In INTERSPEECH-2015, 2635-2639.