EUROSPEECH 2003 - INTERSPEECH 2003
This paper describes a semantic annotation scheme for spoken dialog corpora. Manual semantic annotation of large corpora is tedious, expensive, and subject to inconsistencies. Consistency is a necessity to increase the usefulness of corpus for developing and evaluating spoken understanding models and for linguistics studies. A semantic representation, which is based on a concept dictionary definition, has been formalized and is described. Each utterance is divided into semantic segments and each segment is assigned with a 5-tuplets containing a mode, the underlying concept, the normalized form of the concept, the list of related segments, and an optional comment about the annotation. Based on this scheme, a tool was developed which ensures that the provided annotations respect the semantic representation. The tool includes interfaces for both the formal definition of the hierarchical concept dictionary and the annotation process. An experiment was conducted to assess inter-annotator agreement using both a human-human dialog corpus and a human-machine dialog corpus. For human-human dialogs, the agreement rate, computed on the triplets (mode, concept, value) is 61%, and the agreement rate on the concepts alone is 74%. For the human-machine dialogs, the percentage of agreement on the triplet is 83% and the correct concept identification rate is 93%.
Bibliographic reference. Bonneau-Maynard, Hélène / Rosset, Sophie (2003): "A semantic representation for spoken dialogs", In EUROSPEECH-2003, 253-256.