ISCA Archive SpeechProsody 2022
ISCA Archive SpeechProsody 2022

Naturalness and Intelligibility Monitoring for Text-to-Speech Evaluation

Alex Peiró-Lilja, Guillermo Cámbara, Mireia Farrús, Jordi Luque

Current text-to-speech (TTS) systems are deep learning-based models capable of learning phonetic articulation and intelligibility, as well as prosodic attributes that model speaking style, providing naturalness to synthetic voices. However, the performance of these models highly depends on their training hyper-parameters and iterations. Besides, a conventional loss function does not reflect a correct voice modeling; thus, we believe a dedicated training assessment on TTS is needed. To this end, we monitor intelligibility and naturalness during training of Tacotron2 model in a 2-step process. First, we report the analysis of a method to follow up the intelligibility of the TTS in terms of character-level token error rate (TER) by using five different automatic speech recognition (ASR) systems. Second, we extend this work with a recently published TTS naturalness predictor that estimates this aspect in terms of mean opinion scores (MOS). Finally, we unify predicted MOS with TER measurements to return, over each training checkpoint, a single score that we name Full Assessment Score (FAS). We report the relevant preference of our listeners on the checkpoint with maximum FAS rather than the one with minimum validation loss, both in intelligibility and naturalness ---up to 62.3% in the latter.

doi: 10.21437/SpeechProsody.2022-91

Cite as: Peiró-Lilja, A., Cámbara, G., Farrús, M., Luque, J. (2022) Naturalness and Intelligibility Monitoring for Text-to-Speech Evaluation. Proc. Speech Prosody 2022, 445-449, doi: 10.21437/SpeechProsody.2022-91

  author={Alex Peiró-Lilja and Guillermo Cámbara and Mireia Farrús and Jordi Luque},
  title={{Naturalness and Intelligibility Monitoring for Text-to-Speech Evaluation}},
  booktitle={Proc. Speech Prosody 2022},