Interspeech'2005 - Eurospeech
This paper presents a decoding method for automatic speech recognition (ASR) that reduces the impact of local spectral and temporal variabilities on ASR performance. The procedure involves augmenting the standard Viterbi search for an optimum state sequence with a locally constrained search for optimum degrees of spectral warping or temporal warping applied to individual analysis frames. It is argued in the paper that this represents an efficient and effective method for compensating for local variability in speech which may have potential application to a broader array of speech transformations. The techniques are presented in the context of existing methods for frequency warping based speaker normalization and existing methods for computation of dynamic features for ASR. The modified decoding algorithms were evaluated in both clean and noisy task domains using subsets of the Aurora 2 and Aurora 3 Speech Corpora under clean and noisy conditions. It was found that, under clean conditions on the Spanish Language Subset of the Speech-Dat-Car database, the modified decoding method applied with local frequency transformations reduced word error rate (WER) by 24 percent. This was a factor of two greater reduction in WER than was obtained on the same task using the more well known frequency warping based vocal tract length normalization (VTLN) procedure.
Bibliographic reference. Miguel, Antonio / Lleida, Eduardo / Rose, Richard / Buera, Luis / Ortega, Alfonso (2005): "Augmented state space acoustic decoding for modeling local variability in speech", In INTERSPEECH-2005, 3009-3012.