ISCA Archive Interspeech 2016
ISCA Archive Interspeech 2016

Dynamic Stream Weighting for Turbo-Decoding-Based Audiovisual ASR

Sebastian Gergen, Steffen Zeiler, Ahmed Hussen Abdelaziz, Robert Nickel, Dorothea Kolossa

Automatic speech recognition (ASR) enables very intuitive human-machine interaction. However, signal degradations due to reverberation or noise reduce the accuracy of audio-based recognition. The introduction of a second signal stream that is not affected by degradations in the audio domain (e.g., a video stream) increases the robustness of ASR against degradations in the original domain. Here, depending on the signal quality of audio and video at each point in time, a dynamic weighting of both streams can optimize the recognition performance. In this work, we introduce a strategy for estimating optimal weights for the audio and video streams in turbo-decoding-based ASR using a discriminative cost function. The results show that turbo decoding with this maximally discriminative dynamic weighting of information yields higher recognition accuracy than turbo-decoding-based recognition with fixed stream weights or optimally dynamically weighted audiovisual decoding using coupled hidden Markov models.

doi: 10.21437/Interspeech.2016-166

Cite as: Gergen, S., Zeiler, S., Abdelaziz, A.H., Nickel, R., Kolossa, D. (2016) Dynamic Stream Weighting for Turbo-Decoding-Based Audiovisual ASR. Proc. Interspeech 2016, 2135-2139, doi: 10.21437/Interspeech.2016-166

  author={Sebastian Gergen and Steffen Zeiler and Ahmed Hussen Abdelaziz and Robert Nickel and Dorothea Kolossa},
  title={{Dynamic Stream Weighting for Turbo-Decoding-Based Audiovisual ASR}},
  booktitle={Proc. Interspeech 2016},