EUROSPEECH 2003 - INTERSPEECH 2003
In this paper, a method of detecting and separating speech events in a multiple-sound-source condition using audio and video information is proposed. For detecting speech events, sound localization using a microphone array and human tracking by stereo vision is combined by a Bayesian network. From the inference results of the Bayesian network, the information on the time and location of speech events can be known in a multiple-sound-source condition. Based on the detected speech event information, a maximum likelihood adaptive beamformer is constructed and the speech signal is separated from the background noise and interferences.
Bibliographic reference. Asano, Futoshi / Motomura, Yoichi / Asoh, Hideki / Yoshimura, Takashi / Ichimura, Naoyuki / Yamamoto, Kiyoshi / Kitawaki, Nobuhiko / Nakamura, Satoshi (2003): "Detection and separation of speech segment using audio and video information fusion", In EUROSPEECH-2003, 2257-2260.