The purpose of evaluation is to measure progress, compare different systems, and discern their relative strengths and weaknesses. Evaluation metrics are extremely important, however, we must take care in developing these metrics so that we do not restrict the possible implementation, heuristics or algorithms which could be employed in a spoken language system, nor can these metrics required that systems employ certain types of representations. Recent advances in speech recognition and understanding have led to the development of spoken language systems. Unlike speech recognition systems, where evaluation metrics are well established, spoken language systems contain many different types of components which interact with the speech recognizer. Spoken language systems are computer applications where voice input and output are used to accomplish some task. These systems are typically dialog based and contain many natural language understanding components and can contain databases, reasoning systems, goal and plan detection and inferencing systems, and artifically intelligent planners. Given such complex systems we are now at a point where we must develop metrics for evaluating these systems. In this paper I will discuss evaluation metrics which may enable us to assess both the relative contributions of different components which may be included in a spoken language system and methods for evaluating spoken language systems as a whole against one another. These measures must enable systems with different components to evaluate against one another. Second, we'll discuss methods for evaluating more local phenomenon, or metrics for assessing more granular and specific linguistic capabilities. Examples of such capabilities are indirect speech acts, coreference and reference determination, ability to represent and assess the state of the world, ability to infer goals and plans and to reason from the information we have. Finally I will discuss the issue of language models or grammars. The issue of grammatical coverage versus overgeneralization has recently been an area of much discussion. It appears as if a language model issue interacts with the evaluation metrics for both global and local phenonenon.
Cite as: Young, S.R. (1989) Evaluation techniques for spoken language systems. Proc. Speech Input/Output Assessment and Speech Databases, Vol.2, 219-222
@inproceedings{young89_sioa, author={Sheryl R. Young}, title={{Evaluation techniques for spoken language systems}}, year=1989, booktitle={Proc. Speech Input/Output Assessment and Speech Databases}, pages={Vol.2, 219-222} }