Statistical Approach to Speech Synthesis: Past, Present and Future

Keiichi Tokuda

The basic problem of statistical speech synthesis is quite simple: we have a speech database for training, i.e., a set of speech waveforms and corresponding texts; given a text not included in the training data, what is the speech waveform corresponding to the text? The whole text-to-speech generation process is decomposed into feasible subproblems: usually, text analysis, acoustic modeling, and waveform generation, combined as a statistical generative model. Each submodule can be modeled by a statistical machine learning technique: first, hidden Markov models were applied to acoustic modeling module and then various types of deep neural networks (DNN) have been applied to not only acoustic modeling module but also other modules. I will give an overview of such statistical approaches to speech synthesis, looking back on the evolution in the last couple of decades. Recent DNN-based approaches drastically improved the speech quality, causing a paradigm shift from concatenative speech synthesis approach to generative model-based statistical approach. However, for realizing human-like talking machines, the goal is not only to generate natural-sounding speech but also to flexibly control variations in speech, such as speaker identities, speaking styles, emotional expressions, etc. This talk will also discuss such future challenges and the direction in speech synthesis Research.

Cite as: Tokuda, K. (2019) Statistical Approach to Speech Synthesis: Past, Present and Future. Proc. Interspeech 2019.

  author={Keiichi Tokuda},
  title={{Statistical Approach to Speech Synthesis: Past, Present and Future}},
  booktitle={Proc. Interspeech 2019}