This paper describes an attempt to prevent an external acoustic noise from being misrecognized as a speech recognition object by confirming the lip movement image signal of a speaker as well as the analysis of the acoustic energy in the speech activity detection procedure, which is the preprocess phase of the speech recognition. An image camera for a PC is added to the existing speech recognition environment, and the collected image is analyzed to capture the movement of lips and classify whether it is acoustic speech made by a human or not. It is possible to determine to continue the recognition process based on the confirmation result of image signal data stored in the shared memory.
We combined a speech recognition processor and an image recognizer, and the interworking function successfully operated at the rate of 99.3%. In the case of a subject facing the image camera and speaking, processing normally progressed to the output of the speech recognition result. However, the speech recognition result was not obtained without facing the camera, since the acoustic energy is regarded as noise if any lip movement is not confirmed.
Bibliographic reference. Lee, Soo-jong / Park, Jun / Kim, Eung-kyeu (2007): "Preventing an external acoustic noise from being misrecognized as a speech recognition object by confirming the lip movement image signal", In INTERSPEECH-2007, 718-721.