The short-term temporal information in speech is widely used for automatic speech recognition (ASR) systems in the form of dynamic features. Long-term temporal information has also been focused on recently and is used to complement traditional short-term features (typically from 25 to 100 ms). There are several approaches to represent long-term temporal information in ASR systems. However, those systems use high-dimensional feature spaces to capture the long-term temporal information. This paper describes an attempt to incorporate long-term temporal information into a feature parameter set by combining conventional dynamic features extracted from both short- and long-term cepstrum sequences. The proposed method includes the temporal contexts of phonemes by using long-term features and the spectral variations within phonemes as short-term features. In an experiment on the realistic speech corpus CENSREC-2, the proposed method yielded higher performance than a standard feature parameter set with static mel-frequency cepstral coefficient (MFCCs) and their short-term dynamic features.
Bibliographic reference. Fukuda, Takashi / Ichikawa, Osamu / Nishimura, Masafumi (2008): "Short- and long-term dynamic features for robust speech recognition", In INTERSPEECH-2008, 2262-2265.