Multimodal SpeakerBeam: Single Channel Target Speech Extraction with Audio-Visual Speaker Clues

Tsubasa Ochiai, Marc Delcroix, Keisuke Kinoshita, Atsunori Ogawa, Tomohiro Nakatani


Recently, with the advent of deep learning, there has been significant progress in the processing of speech mixtures. In particular, the use of neural networks has enabled target speech extraction, which extracts speech signal of a target speaker from a speech mixture by utilizing auxiliary clue representing the characteristics of the target speaker. For example, audio clues derived from an auxiliary utterance spoken by the target speaker have been used to characterize the target speaker. Audio clues should capture the fine-grained characteristic of the target speaker’s voice (e.g., pitch). Alternatively, visual clues derived from a video of the target speaker’s face speaking in the mixture have also been investigated. Visual clues should mainly capture the phonetic information derived from lip movements. In this paper, we propose a novel target speech extraction scheme that combines audio and visual clues about the target speaker to take advantage of the information provided by both modalities. We introduce an attention mechanism that emphasizes the most informative speaker clue at every time frame. Experiments on mixture of two speakers demonstrated that our proposed method using audio-visual speaker clues significantly improved the extraction performance compared with the conventional methods using either audio or visual speaker clues.


 DOI: 10.21437/Interspeech.2019-1513

Cite as: Ochiai, T., Delcroix, M., Kinoshita, K., Ogawa, A., Nakatani, T. (2019) Multimodal SpeakerBeam: Single Channel Target Speech Extraction with Audio-Visual Speaker Clues. Proc. Interspeech 2019, 2718-2722, DOI: 10.21437/Interspeech.2019-1513.


@inproceedings{Ochiai2019,
  author={Tsubasa Ochiai and Marc Delcroix and Keisuke Kinoshita and Atsunori Ogawa and Tomohiro Nakatani},
  title={{Multimodal SpeakerBeam: Single Channel Target Speech Extraction with Audio-Visual Speaker Clues}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={2718--2722},
  doi={10.21437/Interspeech.2019-1513},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1513}
}