Current frame-based speech recognition systems sample speech at a fixed set of locations relative to each frame. Modeling the temporal dynamic behavior of speech is thereby complicated. This work shows that by explicitly using transitional information when extracting features, one can better model the acoustic phonetic structure, resulting in higher word level recognition performance. In this proposed approach, features representing local transitional information are used (a constant number of features are selected at each time frame, but the features are sampled near areas of greatest spectrum change within a relatively long window.) By explicitly modeling transitions in this way, we can also model local contextual information. Using this technique, the word level error rate decreased up to 30% on the databases we tested.
Bibliographic reference. Hu, Zhihong / Barnard, Etienne / Cole, Ronald A. (1995): "Transition-based feature extraction within frame-based recognition", In EUROSPEECH-1995, 1555-1558.