Time-Delay Neural Networks (TDNN) have been shown by Waibel et al.  to be a good method for the classification of dynamic speech sounds such as voiced stop consonants. In this paper we discuss key issues in the design and training of a TDNN, based on a Multi-Layer Perceptron (MLP), when used for classification of the sets of voiced stop consonants (/b/, /d/, and /g/) and unvoiced stop consonants (/p/, /t/ and /k/) from the TIMIT database. We show that by transforming each input parameter to the TDNN to be a zero mean, unit variance distribution (separately for each phoneme class) we can greatly improve the overall classification performance. The resulting TDNN classification accuracy for voiced or unvoiced stop consonants is around 91%. This performance is achieved without any specific discriminative spectral measurements and can be applied directly to the classification of any of the dynamic phoneme classes.
Bibliographic reference. Hou, Jun / Rabiner, Lawrence R. / Dusan, Sorin (2007): "On the use of time-delay neural networks for highly accurate classification of stop consonants", In INTERSPEECH-2007, 1929-1932.