ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition
April 13-16, 2003
This paper presents the annotation and statistical analysis of spontaneous speech events in a series of broadcast news interviews drawn from the so called Corpus Oral de Referencia de la Lengua Espanola Contemporánea. The annotated corpus consists of 42 interviews taken from radio and television broadcasts, fully transcribed and lasting 6.41 hours. The corpus is intended primarily to compare frequencies and typologies of spontaneous speech events between task-specific and generic speech, but also to train acoustic and language models and carry out recognition experiments. The annotation process involved two steps: (1) filtering the initial transcriptions, and (2) augmenting the filtered transcriptions with acoustic and lexical events. Filtering was applied not only to adapt the orthographic conventions and the mark-up format but also to discard some of the marks, which were irrelevant from the point of view of speech recognition. Besides human and non-human noises, annotation included acoustic events: lengthenings, silent pauses and filled pauses; lexical events: cut-off words, mispronunciations and guttural affirmations; and speech overlaps, which rarely appear in human-computer dialogues. Statistics show that the probability of finding one of such events at each word is 0.19.
Bibliographic reference. Rodríguez, L. J. / Torres, I. (2003): "Annotation and analysis of acoustic and lexical events in a generic corpus of spontaneous speech in Spanish", in SSPR-2003, paper TAP5.