Microtask platforms such as Amazon Mechanical Turk (AMT) are increasingly used to create speech and language resources. AMT in particular allows researchers to quickly recruit a large number of fairly demographically diverse participants. In this study, we investigated whether AMT can be used for comparing the intelligibility of speech synthesis systems. We conducted two experiments in the lab and via AMT, one comparing US English diphone to US English speaker-adaptive HTS synthesis and one comparing UK English unit selection to UK English speaker-dependent HTS synthesis. While AMT word error rates were worse than lab error rates, AMT results were more sensitive to relative differences between systems. This is mainly due to the larger number of listeners. Boxplots and multilevel modelling allowed us to identify listeners who performed particularly badly, while thresholding was sufficient to eliminate rogue workers. We conclude that AMT is a viable platform for synthetic speech intelligibility comparisons.
Index Terms: intelligibility, evaluation, semantically unpredictable sentences, diphone, unit selection, crowdsourcing, Mechanical Turk, HMM-based synthesis
Cite as: Wolters, M.K., Isaac, K.B., Renals, S. (2010) Evaluating speech synthesis intelligibility using Amazon Mechanical Turk. Proc. 7th ISCA Workshop on Speech Synthesis (SSW 7), 136-141
@inproceedings{wolters10_ssw, author={Maria K. Wolters and Karl B. Isaac and Steve Renals}, title={{Evaluating speech synthesis intelligibility using Amazon Mechanical Turk}}, year=2010, booktitle={Proc. 7th ISCA Workshop on Speech Synthesis (SSW 7)}, pages={136--141} }