This paper describes a speaker-independent 212 word recognition method using dynamic features and averaged features of the speech spectrum based on a two-dimensional mel-cepstrum (TDMC) of a spoken word and a large scale neural network "CombNET-II". TDMC is defined as the two-dimensional Fourier transform of mel-frequency scaled logarithm spectra in the frequency and time domains. CombNET-II has a four-layered neural network with a comb structure. It consists of two parts of neural networks. The first part roughly classifies an input pattern into a category group and the second part precisely classifies the input pattern into a specified category. In this paper, the experiment of speaker-independent word recognition for 212 Japanese words uttered by 10 male speakers is carried out. In the experiment, dynamic features and averaged features based on TDMC are used as the input pattern of CombNET-IL A recognition accuracy of 95.5% can be obtained. This method reduced the amount of calculation to about 1/6 as compared with k-nn classifier. Keyword: word recognition, neural network, spectral features of speech
Bibliographic reference. Sasaki, Taro / Kitamura, Tadashi / Iwata, Akira (1993): "Speaker-independent 212 word recognition using combNET-II", In EUROSPEECH'93, 1013-1016.