We introduce a novel approach to decoding in speech recognition (termed attention-shift decoding) that attempts to mimic aspects of human speech recognition responsible for robustness in processing conversational speech. Our approach is a radical departure from traditional decoding algorithms for speech recognition. We propose a method to first identify reliable regions of the speech signal and then use these to help decode the unreliable regions, thus conditioning on potentially non-consecutive portions of the signal. We test this approach in a second-pass rescoring framework and compare it to standard second-pass rescoring. On a conversational telephone speech recognition task (EARS RT-03 CTS evaluation), our approach shows an improvement of 2.6% absolute when using oracle information for detecting the reliable regions, and 0.4% absolute when detecting the reliable regions automatically.
Cite as: Kumaran, R., Bilmes, J., Kirchhoff, K. (2007) Attention shift decoding for conversational speech recognition. Proc. Interspeech 2007, 1493-1496, doi: 10.21437/Interspeech.2007-432
@inproceedings{kumaran07_interspeech, author={Raghunandan Kumaran and Jeff Bilmes and Katrin Kirchhoff}, title={{Attention shift decoding for conversational speech recognition}}, year=2007, booktitle={Proc. Interspeech 2007}, pages={1493--1496}, doi={10.21437/Interspeech.2007-432} }