![]() |
International Workshop on Hands-Free Speech Communication (HSC2001)April 9-11, 2001 |
![]() |
In this paper, we propose a method to detect the end points of speaking sections (EPD: End Point Detection) by visual information. It is well known that the accuracy of EPD affects speech recognition accuracy. Detecting the speech end points from a noisy audio signal is difficult because the speech is masked by the audio noise. We propose a method for EPD that uses video image of the speaker's facial motion that is not affected by audio noise. Our method locates the skin area by color information and estimates the area that includes the speech organs. Then the end points are detected by the speed and magnitude of image change. The experiment confirms that the proposed method is robust with respect to visual noise. Its detection rate with/without visual noise is 100% while audio (SNR 46 dB) EPD is 99.2%, degrades to 30.1% at SNR 10 dB.
Bibliographic reference. Murai, Kazumasa / Kumatani, Kennichi / Nakamura, Satoshi (2001): "A robust end point detection by speaker's facial motion", In HSC2001, 199-202.