AVSP 2003 - International Conference on Audio-Visual Speech Processing
September 4-7, 2003
This paper presents improvement of recognition of three simultaneous speeches for a humanoid robot with a pair of microphones. In such situations, sound separation and automatic speech recognition (ASR) of the separated speech are difficult, because the number of simultaneous talkers exceeds that of its microphones, the signal-to-noise ratio is quite low (around -3 dB) and noise is not stable due to interfering voices. To improve recognition of three simultaneous speeches, two key ideas are introduced — acoustical modeling of robot head by scattering theory and two-layered audio-visual integration in both name and location, that is, speech and face recognition, and speech and face localization. Sound sources are separated in real-time by an active direction-pass filter (ADPF), which extracts sounds from a specified direction by using interaural phase/intensity difference estimated by scattering theory. Since features of sounds separated by ADPF vary according to the sound direction, multiple Direction- and Speaker-dependent (DS-dependent) acoustic models are used. The system integrates ASR results by using the sound direction and speaker information by face recognition as well as confidence measure of ASR results to select the best one. The resulting system shows around 10% improvement on average against recognition of three simultaneous speeches, where three talkers were located 1 meter from the humanoid and apart from each other by 0 to 90 degrees at 10-degree intervals.
Bibliographic reference. Nakadai, Kazuhiro / Matsuura, Daisuke / Okuno, Hiroshi G. / Tsujino, Hiroshi (2003): "Improvement of three simultaneous speech recognition by using av integration and scattering theory for humanoid", In AVSP 2003, 157-162.