Jointly using audio and video features can increase the robustness of automatic speech recognition systems in noisy environments. A systematic and reliable performance gain, however, is only achieved if the contributions of the audio and video stream to the decoding decision are dynamically optimized, for example via so-called stream weights. In this paper, we address the problem of dynamic stream weight estimation for coupled-HMM-based audio-visual speech recognition. We investigate the multilayer perceptron (MLP) for mapping reliability measure features to stream weights. As an input for the multilayer perceptron, we use a feature vector containing different model-based and signal-based reliability measures. Training of the multilayer perceptron has been achieved using dynamic oracle stream weights as target outputs, which are found using a recently proposed expectation maximization algorithm. This new approach of MLP-based stream-weight estimation has been evaluated using the Grid audio-visual corpus and has outperformed the best baseline performance, yielding a 23.72% average relative error rate reduction.
Bibliographic reference. Abdelaziz, Ahmed Hussen / Kolossa, Dorothea (2014): "Dynamic stream weight estimation in coupled-HMM-based audio-visual speech recognition using multilayer perceptrons", In INTERSPEECH-2014, 1144-1148.