We introduce a new method for scaling up distributed Stochastic Gradient Descent (SGD) training of Deep Neural Networks (DNN). The method solves the well-known communication bottleneck problem that arises for data-parallel SGD because compute nodes frequently need to synchronize a replica of the model. We solve it by purposefully controlling the rate of weight-update per individual weight, which is in contrast to the uniform update-rate customarily imposed by the size of a mini-batch. It is shown empirically that the method can reduce the amount of communication by three orders of magnitude while training a typical DNN for acoustic modelling. This reduction in communication bandwidth enables efficient scaling to more parallel GPU nodes than any other method that we are aware of, and it can be achieved with neither loss in convergence rate nor accuracy in the resulting DNN. Furthermore, the training can be performed on commodity cloud infrastructure and networking.
Bibliographic reference. Strom, Nikko (2015): "Scalable distributed DNN training using commodity GPU cloud computing", In INTERSPEECH-2015, 1488-1492.