Neural iTTS: Toward Synthesizing Speech in Real-time with End-to-end Neural Text-to-Speech Framework

Tomoya Yanagita, Sakriani Sakti, Satoshi Nakamura


Real-time machine speech interpreters aim to mimic human interpreters that able to produce high-quality speech translations on the fly. It requires all system components, including speech recognition, machine translation, and text-to-speech (TTS), to perform incrementally before the speaker has spoken an entire sentence. For TTS, this poses problems as a standard framework commonly requires language-dependent contextual linguistics of a full sentence to produce a natural-sounding speech waveform. Existing studies of incremental TTS (iTTS) have mainly been conducted on a model based on hidden Markov model (HMM). Recently, end-to-end TTS based on a neural net has synthesized more natural speech than HMM-based systems. In this paper, we take an initial step to construct iTTS based on end-to-end neural framework (Neural iTTS) and investigate the effects of various incremental units on the quality of end-to-end neural speech synthesis in both English and Japanese.


 DOI: 10.21437/SSW.2019-33

Cite as: Yanagita, T., Sakti, S., Nakamura, S. (2019) Neural iTTS: Toward Synthesizing Speech in Real-time with End-to-end Neural Text-to-Speech Framework. Proc. 10th ISCA Speech Synthesis Workshop, 183-188, DOI: 10.21437/SSW.2019-33.


@inproceedings{Yanagita2019,
  author={Tomoya Yanagita and Sakriani Sakti and Satoshi Nakamura},
  title={{Neural iTTS: Toward Synthesizing Speech in Real-time with End-to-end Neural Text-to-Speech Framework}},
  year=2019,
  booktitle={Proc. 10th ISCA Speech Synthesis Workshop},
  pages={183--188},
  doi={10.21437/SSW.2019-33},
  url={http://dx.doi.org/10.21437/SSW.2019-33}
}