ISCA Archive SSW 2021
ISCA Archive SSW 2021

Factors Affecting the Evaluation of Synthetic Speech in Context

Johannah O'Mahony, Pilar Oplustil-Gallegos, Catherine Lai, Simon King

Realizing text-to-speech (TTS) system of dialects is useful for personalizing TTS systems. However, TTS for many dialects of pitch accent languages is not realized because of lowresourced problem. Among many dialects of pitch accent languages, this paper focuses on Osaka dialect of Japanese, one of the most challenging pitch accent languages. For Japanese TTS system, accent labels are known to be necessary as input to synthesize natural speech. In rich-resourced dialect, humanresourced approaches and dictionary-based approaches are often used to annotate accent labels for training and inference, but such approaches are unfeasible and time-consuming for lowresourced dialects. In this paper, we propose accent extraction model that utilizes vector quantized variational autoencoder (VQ-VAE) to prepare accent information from speech, and accent prediction models that utilize decision tree and deep learning techniques to predict accent information from the input text. The models were examined with corpus of Osaka dialect, whose accent labels do not exist. The results showed that accent extraction model succeeded in extracting accent information of Osaka dialect from speech utterances as latent variable. It also showed that the accent of synthesized speech by accent prediction models were not better than baseline, but it had advantages such as interpretability.

doi: 10.21437/SSW.2021-26

Cite as: O'Mahony, J., Oplustil-Gallegos, P., Lai, C., King, S. (2021) Factors Affecting the Evaluation of Synthetic Speech in Context. Proc. 11th ISCA Speech Synthesis Workshop (SSW 11), 148-153, doi: 10.21437/SSW.2021-26

  author={Johannah O'Mahony and Pilar Oplustil-Gallegos and Catherine Lai and Simon King},
  title={{Factors Affecting the Evaluation of Synthetic Speech in Context}},
  booktitle={Proc. 11th ISCA Speech Synthesis Workshop (SSW 11)},