Introducing the Turbo-Twin-HMM for Audio-Visual Speech Enhancement

Steffen Zeiler, Hendrik Meutzner, Ahmed Hussen Abdelaziz, Dorothea Kolossa

Models for automatic speech recognition (ASR) hold detailed information about spectral and spectro-temporal characteristics of clean speech signals. Using these models for speech enhancement is desirable and has been the target of past research efforts. In such model-based speech enhancement systems, a powerful ASR is imperative. To increase the recognition rates especially in low-SNR conditions, we suggest the use of the additional visual modality, which is mostly unaffected by degradations in the acoustic channel. An optimal integration of acoustic and visual information is achievable by joint inference in both modalities within the turbo-decoding framework. Thus combining turbo-decoding with Twin-HMMs for speech enhancement, notable improvements can be achieved, not only in terms of instrumental estimates of speech quality, but also in actual speech intelligibility. This is verified through listening tests, which show that in highly challenging noise conditions, average human recognition accuracy can be improved from 64% without signal processing to 80% when using the presented architecture.

DOI: 10.21437/Interspeech.2016-350

Cite as

Zeiler, S., Meutzner, H., Abdelaziz, A.H., Kolossa, D. (2016) Introducing the Turbo-Twin-HMM for Audio-Visual Speech Enhancement. Proc. Interspeech 2016, 1750-1754.

author={Steffen Zeiler and Hendrik Meutzner and Ahmed Hussen Abdelaziz and Dorothea Kolossa},
title={Introducing the Turbo-Twin-HMM for Audio-Visual Speech Enhancement},
booktitle={Interspeech 2016},