This paper describes investigations into the use of ‘excitation-synchronous’ spectral analysis to provide acoustic features for automatic speech recognition. Within each 10 ms frame the region of maximum power is located and used as the centre for the window in a subsequent Fourier transform. The method has been found to be effective in locating stop bursts and vocal-tract responses to glottal closures. This excitation-synchronous analysis has been compared with the more conventional fixed-interval analysis for window lengths ranging from 5 to 25 ms. In connected-digit recognition experiments using mel-cepstrum features, the excitation-synchronous analysis with a window length of 10 ms gave a 10% improvement in recognition performance when compared with the best of the fixed-window conditions.
Cite as: Holmes, W.J. (2000) Improving the representation of time structure in front-ends for automatic speech recognition. Proc. 6th International Conference on Spoken Language Processing (ICSLP 2000), vol. 2, 1073-1076, doi: 10.21437/ICSLP.2000-459
@inproceedings{holmes00_icslp, author={Wendy J. Holmes}, title={{Improving the representation of time structure in front-ends for automatic speech recognition}}, year=2000, booktitle={Proc. 6th International Conference on Spoken Language Processing (ICSLP 2000)}, pages={vol. 2, 1073-1076}, doi={10.21437/ICSLP.2000-459} }