User quality judgements can show a bewildering amount of variation that is difficult to capture using traditional quality prediction approaches. Using clustering, an exploratory statistical analysis technique, we reanalysed the data set of a Wizard-of-Oz experiment where 25 users were asked to rate the dialogue after each turn. The sparse data problem was addressed by careful a priori parameter choices and comparison of the results of different cluster algorithms. We found two distinct classes of users, positive and critical. Positive users were generally happy with the dialogue system, and did not mind errors. Critical users downgraded their opinion of the system after errors, used a wider range of ratings, and were less likely to rate the system positively overall. These user groups could not be predicted by experience with spoken dialogue systems, attitude to spoken dialogue systems, affinity with technology, demographics, or short-term memory capacity. We suggest that evaluation research should focus on critical users and discuss how these might be identified.
Cite as: Wolters, M.K., Gödde, F., Möller, S., Engelbrecht, K.-P. (2010) Finding Patterns in User Quality Judgements. Proc. 3rd International Workshop on Perceptual Quality of Systems (PQS 2010), 41-46, doi: 10.21437/PQS.2010-8
@inproceedings{wolters10_pqs, author={Maria K. Wolters and Florian Gödde and Sebastian Möller and Klaus-Peter Engelbrecht}, title={{Finding Patterns in User Quality Judgements}}, year=2010, booktitle={Proc. 3rd International Workshop on Perceptual Quality of Systems (PQS 2010)}, pages={41--46}, doi={10.21437/PQS.2010-8} }