Rakugo speech synthesis using segment-to-segment neural transduction and style tokens — toward speech synthesis for entertaining audiences

Shuhei Kato, Yusuke Yasuda, Xin Wang, Erica Cooper, Shinji Takaki, Junichi Yamagishi


We have been working on constructing rakugo speech synthesis as a challenging example of speech synthesis that entertains audiences. Rakugo is a traditional Japanese form of verbal entertainment that is similar to one-person stand-up comedy. In rakugo, a performer himself/herself plays multiple characters, and conversations by them make the story progress. We tried to build a rakugo synthesizer with state-of-the-art encoder-decoder models with attention such as Tacotron 2. However, it did not work well because the expressions of rakugo speech are far more diverse than those of read speech. We therefore use segment-to-segment neural transduction (SSNT) in place of a combination of attention and decoder. Furthermore, we experimented with global style tokens (GST) and manually-labeled context features to enrich the speaking styles of synthesized rakugo speech. The results show that SSNT greatly helps align the encoder and decoder time steps and that GST help reproduce characteristics better.


 DOI: 10.21437/SSW.2019-20

Cite as: Kato, S., Yasuda, Y., Wang, X., Cooper, E., Takaki, S., Yamagishi, J. (2019) Rakugo speech synthesis using segment-to-segment neural transduction and style tokens — toward speech synthesis for entertaining audiences. Proc. 10th ISCA Speech Synthesis Workshop, 111-116, DOI: 10.21437/SSW.2019-20.


@inproceedings{Kato2019,
  author={Shuhei Kato and Yusuke Yasuda and Xin Wang and Erica Cooper and Shinji Takaki and Junichi Yamagishi},
  title={{Rakugo speech synthesis using segment-to-segment neural transduction and style tokens — toward speech synthesis for entertaining audiences}},
  year=2019,
  booktitle={Proc. 10th ISCA Speech Synthesis Workshop},
  pages={111--116},
  doi={10.21437/SSW.2019-20},
  url={http://dx.doi.org/10.21437/SSW.2019-20}
}