Recent interest in “deep learning”, which can be defined as the use of algorithms to model high-level abstractions in data, using models composed of multiple non-linear transformations, has resulted in an increase in the number of studies investigating the use of deep learning with automatic speech recognition (ASR) systems. Some of these studies have found that bottleneck features extracted from deep neural networks (DNNs), sometimes called “deep bottleneck features” (DBNFs), can reduce the word error rates of ASR systems. However, there has been little research on audio-visual speech recognition (AVSR) systems using DBNFs. In this paper, we propose a method of integrating DBNFs using multi-stream HMMs in order to improve the performance of AVSRs under both clean and noisy conditions. We evaluate our method using a continuously spoken, Japanese digit recognition task under matched and mismatched conditions. Relative word error reduction rates of roughly 68.7%, 47.4%, and 51.9% were achieved, compared with an audio-only ASR system and two feature-fusion models, which employed DBNFs and single-stream HMMs, respectively.
Bibliographic reference. Ninomiya, Hiroshi / Kitaoka, Norihide / Tamura, Satoshi / Iribe, Yurie / Takeda, Kazuya (2015): "Integration of deep bottleneck features for audio-visual speech recognition", In INTERSPEECH-2015, 563-567.