When shifting by a few samples a speech signal, we have observed significant variations of the feature vectors produced by the acoustic front-end. Furthermore, these utterances when decoded with a continuous speech recognition system leads to dramatically different word error rates. This paper analyzes the phenomena and illustrates the well known result that classical acoustic front end processors including spectrum and cepstra based techniques suffer from time-shift. After describing the effect of sample sized shifts on the spectral estimates of the signal, we propose several techniques which take advantage of shift variations to multiply the amount of training that speech utterances can provide. Eventually, we illustrate how it is possible to slightly modify the acoustic front-end to render the recognizer invariant to small shifts.
Cite as: Basu, S., Ittycheriah, A., Maes, S. (1998) Time shift invariant speech recognition. Proc. 5th International Conference on Spoken Language Processing (ICSLP 1998), paper 0983, doi: 10.21437/ICSLP.1998-656
@inproceedings{basu98b_icslp, author={Sankar Basu and Abraham Ittycheriah and Stéphane Maes}, title={{Time shift invariant speech recognition}}, year=1998, booktitle={Proc. 5th International Conference on Spoken Language Processing (ICSLP 1998)}, pages={paper 0983}, doi={10.21437/ICSLP.1998-656} }