This paper describes a framework for evaluation of spoken dialogue systems. Typically, evaluation of dialogue systems is performed in a controlled test environment with carefully selected and instructed users. However, this approach is very demanding. An alternative is to recruit a large group of users who evaluate the dialogue systems in a remote setting under virtually no supervision. Crowdsourcing technology, for example Amazon Mechanical Turk (AMT), provides an efficient way of recruiting subjects. This paper describes an evaluation framework for spoken dialogue systems using AMT users and compares the obtained results with a recent trial in which the systems were tested by locally recruited users. The results suggest that the use of crowdsourcing technology is feasible and it can provide reliable results.
Bibliographic reference. Jurčíček, F. / Keizer, S. / Gašić, Milica / Mairesse, François / Thomson, B. / Yu, K. / Young, Steve (2011): "Real user evaluation of spoken dialogue systems using Amazon Mechanical Turk", In INTERSPEECH-2011, 3061-3064.