We show empirically that in SGD training of deep neural networks, one can, at
no or nearly no loss of accuracy, quantize the gradients aggressively
to but one bit per value if the quantization error is carried forward
across minibatches (error feedback). This size reduction makes it feasible to
parallelize SGD through data-parallelism with fast processors like recent GPUs.
We implement data-parallel deterministically distributed SGD by combining this finding with AdaGrad, automatic minibatch-size selection, double buffering, and model parallelism. Unexpectedly, quantization benefits AdaGrad, giving a small accuracy gain.
For a typical Switchboard DNN with 46M parameters, we reach computation speeds of 27k frames per second (kfps) when using 2880 samples per minibatch, and 51kfps with 16k, on a server with 8 K20X GPUs. This corresponds to speed-ups over a single GPU of 3.6 and 6.3, respectively. 7 training passes over 309h of data complete in under 7h. A 160M-parameter model training processes 3300h of data in under 16h on 20 dual-GPU servers a 10 times speed-up albeit at a small accuracy loss.
Bibliographic reference. Seide, Frank / Fu, Hao / Droppo, Jasha / Li, Gang / Yu, Dong (2014): "1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs", In INTERSPEECH-2014, 1058-1062.