The French Technolangue MEDIA-EVALDA project aims to evaluate spoken understanding approaches. This paper describes the semantic annotation scheme of a common dialog corpus which will be used for developing and evaluating spoken understanding models and for linguistic studies. A common semantic representation has been formalized and agreed upon by the consortium. Each utterance is divided into semantic segments and each segment is annotated with a 5-tuplet containing the mode, attribute name representing the underlying concept, normalized form of the attribute, list of related segments, and an optional comment about the annotation. Periodic inter-annotator agreement studies demonstrate that the annotation are of good quality, with an agreement of almost 90% on mode and attribute identification. An analysis of the semantic content of 12292 annotated client utterances shows that only 14.1% of the observed attributes are domain-dependent and that the semantic dictionary ensures a good coverage of the task.
Cite as: Bonneau-Maynard, H., Rosset, S., Ayache, C., Kuhn, A., Mostefa, D. (2005) Semantic annotation of the French media dialog corpus. Proc. Interspeech 2005, 3457-3460, doi: 10.21437/Interspeech.2005-312
@inproceedings{bonneaumaynard05_interspeech, author={H. Bonneau-Maynard and Sophie Rosset and C. Ayache and A. Kuhn and Djamel Mostefa}, title={{Semantic annotation of the French media dialog corpus}}, year=2005, booktitle={Proc. Interspeech 2005}, pages={3457--3460}, doi={10.21437/Interspeech.2005-312} }