Visual speech, i.e., video recordings of speakers’ mouths, plays an important role in improving the robustness properties of automatic speech recognition (ASR) against noise. Optimal fusion of audio and video modalities is still one of the major challenges that attracts significant interest in the realm of audio-visual ASR. Recently, turbo decoders (TDs) have been successful in addressing the audio-visual fusion problem. The idea of the TD framework is to iteratively exchange some kind of soft information between the audio and video decoders until convergence. The forward-backward algorithm (FBA) is mostly applied to the decoding graphs to estimate this soft information. Applying the FBA to the complex graphs that are usually used in large vocabulary tasks may be computationally expensive. In this paper, I propose to apply the forward-backward algorithm to a lattice of most likely state sequences instead of using the entire decoding graph. Using lattices allows for TD to be easily applied to large vocabulary tasks. The proposed approach is evaluated using the newly released TCD-TIMIT corpus, where a standard recipe for large vocabulary ASR is employed. The modified TD performs significantly better than the feature and decision fusion models in all clean and noisy test conditions.
Cite as: Abdelaziz, A.H. (2017) Turbo Decoders for Audio-Visual Continuous Speech Recognition. Proc. Interspeech 2017, 3667-3671, doi: 10.21437/Interspeech.2017-799
@inproceedings{abdelaziz17_interspeech, author={Ahmed Hussen Abdelaziz}, title={{Turbo Decoders for Audio-Visual Continuous Speech Recognition}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={3667--3671}, doi={10.21437/Interspeech.2017-799} }