Bias and Statistical Significance in Evaluating Speech Synthesis with Mean Opinion Scores

Andrew Rosenberg, Bhuvana Ramabhadran


Listening tests and Mean Opinion Scores (MOS) are the most commonly used techniques for the evaluation of speech synthesis quality and naturalness. These are invaluable in the assessment of subjective qualities of machine generated stimuli. However, there are a number of challenges in understanding the MOS scores that come out of listening tests.

Primarily, we advocate for the use of non-parametric statistical tests in the calculation of statistical significance when comparing listening test results.

Additionally, based on the results of 46 legacy listening tests, we measure the impact of two sources of bias. Bias introduced by individual participants and synthesized text can a dramatic impact on observed MOS scores. For example, we find that on average the mean difference between the highest and lowest scoring rater is over 2 MOS points (on a 5 point scale). From this observation, we caution against using any statistical test without adjusting for this bias, and provide specific non-parametric recommendations.


 DOI: 10.21437/Interspeech.2017-479

Cite as: Rosenberg, A., Ramabhadran, B. (2017) Bias and Statistical Significance in Evaluating Speech Synthesis with Mean Opinion Scores. Proc. Interspeech 2017, 3976-3980, DOI: 10.21437/Interspeech.2017-479.


@inproceedings{Rosenberg2017,
  author={Andrew Rosenberg and Bhuvana Ramabhadran},
  title={Bias and Statistical Significance in Evaluating Speech Synthesis with Mean Opinion Scores},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={3976--3980},
  doi={10.21437/Interspeech.2017-479},
  url={http://dx.doi.org/10.21437/Interspeech.2017-479}
}