Interspeech'2005 - Eurospeech
A fixed scale (typically 25ms) short time spectral analysis of speech signals, which are inherently multi-scale in nature (typically vowels last for 40-80ms while stops last for 10-20ms), is clearly sub-optimal for time-frequency resolution. Based on the usual assumption that the speech signal can be modeled by a time-varying autoregressive (AR) Gaussian process, we estimate the largest piecewise quasi-stationary speech segments, based on the likelihood that a segment was generated by the same AR process. This likelihood is estimated from the Linear Prediction (LP) residual error. Each of these quasi-stationary segments is then used as an analysis window from which spectral features are extracted. Such an approach thus results in a variable scale time spectral analysis, adaptively estimating the largest possible analysis window size such that the signal remains quasi-stationary, thus the best temporal/frequency resolution tradeoff. The speech recognition experiments on the OGI Numbers95 database, show that the proposed variable-scale piecewise stationary spectral analysis based features indeed yield improved recognition accuracy in clean conditions, compared to features based on minimum cross entropy spectrum  as well as those based on fixed scale spectral analysis.
Bibliographic reference. Tyagi, Vivek / Wellekens, Christian / Bourlard, Hervé (2005): "On variable-scale piecewise stationary spectral analysis of speech signals for ASR", In INTERSPEECH-2005, 209-212.