Interspeech'2005 - Eurospeech

Lisbon, Portugal
September 4-8, 2005

Czech Spontaneous Speech Corpus with Structural Metadata

Jáchym Kolár (1), Jan Svec (1), Stephanie Strassel (2), Christopher Walker (2), Dagmar Kozlíková (1), Josef Psutka (1)

(1) University of West Bohemia in Pilsen, Czech Republic; (2) University of Pennsylvania, USA

This paper describes a Czech spontaneous speech corpus consisting of radio talk show recordings. As the first complete non-English MDE corpus, it has been annotated with structural metadata information beyond the words that is critical to both increasing transcript readability and allowing application of downstream NLP methods. Metadata annotation involves partitioning verbatim transcripts into syntactic/semantic units (SUs) that function to express a complete idea; and identifying fillers and edit disfluencies. Annotation guidelines for English metadata developed by Linguistic Data Consortium were taken as the starting point, with changes applied to accommodate specific phenomena of Czech. In addition to the necessary language-dependent modifications, we further propose some language-independent modifications including limited prosodic labeling at SU boundaries. Statistics about the structural metadata annotation present in the corpus and inter-annotator agreement numbers are also presented.

Full Paper

Bibliographic reference.  Kolár, Jáchym / Svec, Jan / Strassel, Stephanie / Walker, Christopher / Kozlíková, Dagmar / Psutka, Josef (2005): "Czech spontaneous speech corpus with structural metadata", In INTERSPEECH-2005, 1165-1168.