Sixth European Conference on Speech Communication and Technology
Features for automatic speech recognition (ASR) are typically sampled at about 100 Hz (10 ms analysis step). Recent experiments indicate that the most efficient components of the modulation spectrum of speech for ASR are up to about 16 Hz. Consequently, RASTA processing attenuates modulation frequencies higher than 16 Hz and should in principle allow for a subsequent down-sampling of the features. It has been shown earlier that in a Gaussian mixture model based speaker recognition system(which uses single state HMM, thus not requiring any time alignments of the incoming speech) one could down-sample the speech representation after RASTA filtering without any significant loss of performance. However since ASR uses Viterbi time alignment, reduced number of time samples due to down-sampling, although justified by Nyquist criteria after the low-pass filtering, could create problems. In this paper we experimentally show that the down-sampling of features after RASTA filtering is feasible and could result in considerable computational or at least storage/transmission savings.
Full Paper (PDF) Gnu-Zipped Postscript
Bibliographic reference. Hermansky, Hynek / Jain, Pratibha (1999): "Down-sampling speech representation in ASR", In EUROSPEECH'99, 73-76.