INTERSPEECH 2004 - ICSLP
This paper describes a novel speech-interface function, called "speech spotter", which enables a user to enter voice commands into a speech recognizer in the midst of natural human-human conversation. In the past, it has been difficult to use automatic speech recognition in human-human conversation since it was not easy to judge, from only microphone input, whether a user was speaking to another person or a speech recognizer. We solve this problem by using two kinds of nonverbal speech information: a filled pause (a vowel-lengthening hesitation like "er...") and voice pitch. Only when a user utters a voice command with a high pitch just after a filled pause is the voice command accepted by the speech recognizer. By using this speech-spotter function, we have built two application systems: an on-demand information system for assisting human-human conversation and a music-playback system for enriching telephone conversation. The results from using these systems have shown that the speech-spotter function is robust and convenient enough to be used in face-to-face or cellular-phone conversations.
Bibliographic reference. Goto, Masataka / Kitayama, Koji / Itou, Katsunobu / Kobayashi, Tetsunori (2004): "Speech spotter: on-demand speech recognition in human-human conversation on the telephone or in face-to-face situations", In INTERSPEECH-2004, 1533-1536.