Speech Prosody 2004

Nara, Japan
March 23-26, 2004

Direct Modeling of Prosody: An Overview of Applications in Automatic Speech Processing

Elizabeth Shriberg, Andreas Stolcke

Speech Technology and Research Laboratory, SRI International, Menlo Park, CA, USA; and International Computer Science Institute, Berkeley, CA, USA

We describe a "direct modeling" approach to using prosody in various speech technology tasks. The approach does not involve any hand-labeling or modeling of prosodic events such as pitch accents or boundary tones. Instead, prosodic features are extracted directly from the speech signal and from the output of an automatic speech recognizer. Machine learning techniques then determine a prosodic model, which is integrated with lexical and other information to predict the target classes of interest. We discuss task-specific modeling and results for a line of research covering four general application areas: (1) structural tagging (finding sentence boundaries, disfluencies), (2) pragmatic and paralinguistic tagging (classifying dialog acts, emotion, and "hot spots"), (3) speaker recognition, and (4) word recognition itself. To provide an idea of performance on realworld data, we focus on spontaneous (rather than read or acted) speech from a variety of contexts-including human-human telephone conversations, game-playing, human-computer dialog, and multi-party meetings.

Full Paper

Bibliographic reference.  Shriberg, Elizabeth / Stolcke, Andreas (2004): "Direct modeling of prosody: an overview of applications in automatic speech processing", In SP-2004, 575-582.