ISCA Archive SSW 2021
ISCA Archive SSW 2021

Accent Modeling of Low-Resourced Dialect in Pitch Accent Language Using Variational Autoencoder

Kazuya Yufune, Tomoki Koriyama, Shinnosuke Takamichi, Hiroshi Saruwatari

In this paper, we propose an emotion-controllable text-tospeech (TTS) model that allows both emotional-level (i.e., coarse-grained) control and prosodic-feature-level (i.e., finegrained) control of speech using both emotional soft-labels and prosodic features. Conventional methods control speech only by using emotional labels or prosodic features (e.g., mean and standard deviation of pitch), which cannot express diverse emotions. Our model is based on a prosodic feature generator that decodes emotion soft-labels into prosodic features. It allows controlling the emotion of synthetic speech by both emotion labels and prosodic features. The experiment results show 1) the emotion-perceptual accuracy of synthetic speech reaches 66 % 2) the mean opinion score for the naturalness of emotionally controlled synthetic speech was 3.5, which is comparable to the conventional method using prosodic features.


doi: 10.21437/SSW.2021-33

Cite as: Yufune, K., Koriyama, T., Takamichi, S., Saruwatari, H. (2021) Accent Modeling of Low-Resourced Dialect in Pitch Accent Language Using Variational Autoencoder. Proc. 11th ISCA Speech Synthesis Workshop (SSW 11), 189-194, doi: 10.21437/SSW.2021-33

@inproceedings{yufune21_ssw,
  author={Kazuya Yufune and Tomoki Koriyama and Shinnosuke Takamichi and Hiroshi Saruwatari},
  title={{Accent Modeling of Low-Resourced Dialect in Pitch Accent Language Using Variational Autoencoder}},
  year=2021,
  booktitle={Proc. 11th ISCA Speech Synthesis Workshop (SSW 11)},
  pages={189--194},
  doi={10.21437/SSW.2021-33}
}