12th Annual Conference of the International Speech Communication Association

Florence, Italy
August 27-31. 2011

On Development of Consistently Punctuated Speech Corpora

Jáchym Kolář, Lori Lamel

LIMSI, France

Punctuation of automatically recognized speech is important to enhance readability of transcripts and to aid downstream NLP processing. This paper is concerned with issues involved in developing training and test corpora for automatic punctuation systems. Punctuation annotation in speech transcripts is difficult since there are numerous cases for which no standard punctuation rules exist. Special punctuation annotation guidelines tailored to spoken language were developed. Using these guidelines, almost 100 hours of broadcast news and conversation data in English and French have been punctuated by trained annotators. Measures of inter-annotator agreement are provided for both languages and differences between languages and genre are analyzed and discussed, along with some of the most frequent disagreements between annotators. Overall, using the guidelines, the annotation consistency has been significantly improved.

Full Paper

Bibliographic reference.  Kolář, Jáchym / Lamel, Lori (2011): "On development of consistently punctuated speech corpora", In INTERSPEECH-2011, 833-836.