We present a feature extraction technique for automatic speech recognition that uses Tandem representation of short-term spectral envelope and modulation frequency features. These features, derived from sub-band temporal envelopes of speech estimated using frequency domain linear prediction, are combined at the phoneme posterior level. Tandem representations derived from these phoneme posteriors are used along with HMM-based ASR systems for both small and large vocabulary continuous speech recognition (LVCSR) tasks. For a small vocabulary continuous digit task on the OGI Digits database, the proposed features reduce the word error rate (WER) by 13% relative to other feature extraction techniques. We obtain a relative reduction of about 14% in WER for an LVCSR task using the NIST RT05 evaluation data. For phoneme recognition tasks on the TIMIT database these features provide a relative improvement of 13% compared to other techniques.
Bibliographic reference. Thomas, Samuel / Ganapathy, Sriram / Hermansky, Hynek (2009): "Tandem representations of spectral envelope and modulation frequency features for ASR", In INTERSPEECH-2009, 2955-2958.