ISCA Archive SSW 2023
ISCA Archive SSW 2023

Stuck in the MOS pit: A critical analysis of MOS test methodology in TTS evaluation

Ambika Kirkland, Shivam Mehta, Harm Lameris, Gustav Eje Henter, Eva Szekely, Joakim Gustafson

The Mean Opinion Score (MOS) is a prevalent metric in TTSevaluation. Although standards for collecting and reportingMOS exist, researchers seem to use the term inconsistently, andunderreport the details of their testing methodologies. A survey of Interspeech and SSW papers from 2021-2022 shows thatmost authors do not report scale labels, increments, or instructions to participants, and those who do diverge in terms of theirimplementation. It is also unclear in many cases whether listeners were asked to rate naturalness, or overall quality. MOSobtained for natural speech using different testing methodologies vary in the surveyed papers: specifically, quality MOS ison average higher than naturalness MOS. We carried out several listening tests using the same stimuli but with differencesin the scale increment and instructions about what participantsshould rate, and found that both of these variables affected MOSfor some systems.


doi: 10.21437/SSW.2023-7

Cite as: Kirkland, A., Mehta, S., Lameris, H., Henter, G.E., Szekely, E., Gustafson, J. (2023) Stuck in the MOS pit: A critical analysis of MOS test methodology in TTS evaluation. Proc. 12th ISCA Speech Synthesis Workshop (SSW2023), 41-47, doi: 10.21437/SSW.2023-7

@inproceedings{kirkland23_ssw,
  author={Ambika Kirkland and Shivam Mehta and Harm Lameris and Gustav Eje Henter and Eva Szekely and Joakim Gustafson},
  title={{Stuck in the MOS pit: A critical analysis of MOS test methodology in TTS evaluation}},
  year=2023,
  booktitle={Proc. 12th ISCA Speech Synthesis Workshop (SSW2023)},
  pages={41--47},
  doi={10.21437/SSW.2023-7}
}