Text-To-Speech (TTS) systems are commonly evaluated alongtwo main dimensions: intelligibility and naturalness. Whilethere are clear proxies for intelligibility measurements such astranscription Word-Error-Rate (WER), naturalness is not nearlyso well defined. In this paper, we present the results of ourattempt to learn what aspects human listeners consider whenthey are asked to evaluate the “naturalness” of TTS systems.We conducted a user study similar to common TTS evaluationsand at the end asked the subject to define the sense ofnaturalness that they had used. Then we coded their answersand statistically analysed the distribution of codes to create alist of aspects that users consider as part of naturalness. We cannow provide a list of suggested replacement questions to useinstead of a single oblique notion of naturalness.
Cite as: Shirali-Shahreza, S., Penn, G. (2023) Better Replacement for TTS Naturalness Evaluation. Proc. 12th ISCA Speech Synthesis Workshop (SSW2023), 197-203, doi: 10.21437/SSW.2023-31
@inproceedings{shiralishahreza23_ssw, author={Sajad Shirali-Shahreza and Gerald Penn}, title={{Better Replacement for TTS Naturalness Evaluation}}, year=2023, booktitle={Proc. 12th ISCA Speech Synthesis Workshop (SSW2023)}, pages={197--203}, doi={10.21437/SSW.2023-31} }