7th International Conference on Spoken Language Processing
September 16-20, 2002
An evaluation of the reliability of the ITU-T P.85 recommended standard for the evaluation of voice output systems was conducted using six English TTS systems. The P.85 standard is based on meanopinion- score judgements of a listening panel on a number of rating scales. The study looked at how the ranking of the six systems on the scales varied across four different text genres and across two listening sessions. Rankings were also compared with a much simpler pair-comparison test across genres and listening sessions. For the ITU test a large degree of correlation was found across scales, implying that these were not really testing different aspects of the systems. There were surprisingly similar results across sessions, implying that listeners were indeed making real judgements. In comparison, the pair comparison test gave (almost) identical rankings for systems with far less variability, making statistically significant comparisons between systems possible, even across genres.
Bibliographic reference. Alvarez, Yolanda Vazquez / Huckvale, Mark (2002): "The reliability of the ITU-t p.85 standard for the evaluation of text-to-speech systems", In ICSLP-2002, 329-332.