The aim of this paper is to generate a more comprehensiveframework for evaluating synthetic speech. To this end, a lineof tests resulting in an exploratory factor analysis (EFA) havebeen carried out. The proposed dimensions that encapsulate theconstruct of “synthetic speech quality” are: “human-likeness”,“audio quality”, “negative emotion”, “dominance”, “positiveemotion”, “calmness”, “seniority” and “gender”, with item-to-total correlations pointing towards “gender” being an orthogonal construct. A subsequent analysis on common acoustic features, found in forensic and phonetic literature, reveals veryweak correlations with the proposed scales. Inter-rater andinter-item agreement measures additionally reveal low consistency within the scales. We also make the case that there is aneed for a more fine grained approach when investigating thequality of synthetic speech systems, and propose a method thatattempts to capture individual quality dimensions in the timedomain.
Cite as: Seebauer, F., Kuhlmann, M., Haeb-Umbach, R., Wagner, P. (2023) Re-examining the quality dimensions of synthetic speech. Proc. 12th ISCA Speech Synthesis Workshop (SSW2023), 34-40, doi: 10.21437/SSW.2023-6
@inproceedings{seebauer23_ssw, author={Fritz Seebauer and Michael Kuhlmann and Reinhold Haeb-Umbach and Petra Wagner}, title={{Re-examining the quality dimensions of synthetic speech}}, year=2023, booktitle={Proc. 12th ISCA Speech Synthesis Workshop (SSW2023)}, pages={34--40}, doi={10.21437/SSW.2023-6} }