EUROSPEECH 2003 - INTERSPEECH 2003
8th European Conference on Speech Communication and Technology

Geneva, Switzerland
September 1-4, 2003

        

Detection and Separation of Speech Segment Using Audio and Video Information Fusion

Futoshi Asano (1), Yoichi Motomura (1), Hideki Asoh (1), Takashi Yoshimura (1), Naoyuki Ichimura (1), Kiyoshi Yamamoto (2), Nobuhiko Kitawaki (2), Satoshi Nakamura (3)

(1) AIST, Japan
(2) Tsukuba University, Japan
(3) ATR-SLT, Japan

In this paper, a method of detecting and separating speech events in a multiple-sound-source condition using audio and video information is proposed. For detecting speech events, sound localization using a microphone array and human tracking by stereo vision is combined by a Bayesian network. From the inference results of the Bayesian network, the information on the time and location of speech events can be known in a multiple-sound-source condition. Based on the detected speech event information, a maximum likelihood adaptive beamformer is constructed and the speech signal is separated from the background noise and interferences.

Full Paper

Bibliographic reference.  Asano, Futoshi / Motomura, Yoichi / Asoh, Hideki / Yoshimura, Takashi / Ichimura, Naoyuki / Yamamoto, Kiyoshi / Kitawaki, Nobuhiko / Nakamura, Satoshi (2003): "Detection and separation of speech segment using audio and video information fusion", In EUROSPEECH-2003, 2257-2260.