EUROSPEECH 2003 - INTERSPEECH 2003
Spontaneously spoken utterances are characterized by a number of lexical and non-lexical features. These features can also reflect speaker specific characteristics. A major factor that discriminates spontaneous speech from written text is the presence of these paralinguistic features such as filled pauses (fillers), false starts, laughter, disfluencies and discourse markers that are beyond the framework of formal grammars. The speech recognition community has dealt with these variabilities by making provisions for them in language models, to improve recognition accuracy for spoken language. In another scenario, the analysis of these features could also be used for language processing/generation for the overall improvement of synthesized speech or machine response. Such synthesized spontaneous speech could be used for computer avatars and Speech User Interfaces (SUIs) where lengthy interactions with machines occur, and it is generally desired to mimic a particular speaker or the speaking style. This problem of language generation involves capturing general characteristics of spontaneous speech and also speaker specific traits. The usefulness of conventional language processing tools is limited by the availability of training corpus. Hence and empirical text processing technique with ideas motivated from psycholinguistics is proposed. Such an empirical technique could be included in the text analysis stage of a TTS system. The proposed technique is adaptable: it can be extended to mimic different speakers based on an individual's speaking style and filler preferences.
Bibliographic reference. Sundaram, Shiva / Narayanan, Shrikanth (2003): "An empirical text transformation method for spontaneous speech synthesizers", In EUROSPEECH-2003, 1221-1224.