Re-examining the quality dimensions of synthetic speech

Fritz Seebauer, Michael Kuhlmann, Reinhold Haeb-Umbach, Petra Wagner

The aim of this paper is to generate a more comprehensiveframework for evaluating synthetic speech. To this end, a lineof tests resulting in an exploratory factor analysis (EFA) havebeen carried out. The proposed dimensions that encapsulate theconstruct of “synthetic speech quality” are: “human-likeness”,“audio quality”, “negative emotion”, “dominance”, “positiveemotion”, “calmness”, “seniority” and “gender”, with item-to-total correlations pointing towards “gender” being an orthogonal construct. A subsequent analysis on common acoustic features, found in forensic and phonetic literature, reveals veryweak correlations with the proposed scales. Inter-rater andinter-item agreement measures additionally reveal low consistency within the scales. We also make the case that there is aneed for a more fine grained approach when investigating thequality of synthetic speech systems, and propose a method thatattempts to capture individual quality dimensions in the timedomain.

doi: 10.21437/SSW.2023-6

