In this paper we describe parallel implementation of ANN training procedure based on block mode back-propagation learning algorithm. Two different approaches to training parallelization were implemented. The first is data parallelization using POSIX threads, it is suitable for multi-core computers. The second is node parallelization using high performance SIMD architecture of GPU with CUDA, suitable for CUDA enabled computers. We compare the speedup of both approaches by learning typically-sized network on the real-world phoneme-state classification task, showing nearly 10 times reduction when using CUDA version, while the 8-core server with multi-thread version gives only 4 times reduction. In both cases we compared to an already BLAS optimized implementation. The training tool will be released as Open-Source software under project name TNet.
Bibliographic reference. Veselý, Karel / Burget, Lukáš / Grézl, František (2010): "Parallel training of neural networks for speech recognition", In INTERSPEECH-2010, 2934-2937.