We investigate the impact of input linguistic feature representation on Japanese end-to-end speech synthesis. An end-toend speech synthesis system, which directly generates natural speech from text, has recently been proposed. The English endto-end system Tacotron 2 achieves sound quality close to that of natural speech. However, unlike alphabetic language that use stress accent, such as English and Spanish, it is difficult to achieve end-to-end speech synthesis with other non-alphabetic languages (e.g., Japanese and Chinese, which use pitch accent and tone, respectively, and use ideograms). We investigated the units of an input sequence, contexts, pause insertion, vowel devoicing, and pronunciation of particles for Japanese end-to-end speech synthesis. Experimental results indicate improvement in the naturalness of the synthesized speech using high or low accents. The results also indicate that the accent-phrase information can help to predict pause insertion, and an end-to-end text-to-speech model may be able to change the pronunciation for devoiced vowels and particles.
Cite as: Fujimoto, T., Hashimoto, K., Oura, K., Nankaku, Y., Tokuda, K. (2019) Impacts of input linguistic feature representation on Japanese end-to-end speech synthesis. Proc. 10th ISCA Workshop on Speech Synthesis (SSW 10), 166-171, doi: 10.21437/SSW.2019-30
@inproceedings{fujimoto19_ssw, author={Takato Fujimoto and Kei Hashimoto and Keiichiro Oura and Yoshihiko Nankaku and Keiichi Tokuda}, title={{Impacts of input linguistic feature representation on Japanese end-to-end speech synthesis}}, year=2019, booktitle={Proc. 10th ISCA Workshop on Speech Synthesis (SSW 10)}, pages={166--171}, doi={10.21437/SSW.2019-30} }