An HMM/DNN Comparison for Synchronized Text-to-Speech and Tongue Motion Synthesis

S├ębastien Le Maguer, Ingmar Steiner, Alexander Hewer


We present an end-to-end text-to-speech (TTS) synthesis system that generates audio and synchronized tongue motion directly from text. This is achieved by adapting a statistical shape space model of the tongue surface to an articulatory speech corpus and training a speech synthesis system directly on the tongue model parameter weights. We focus our analysis on the application of two standard methodologies, based on Hidden Markov Models (HMMs) and Deep Neural Networks (DNNs), respectively, to train both acoustic models and the tongue model parameter weights. We evaluate both methodologies at every step by comparing the predicted articulatory movements against the reference data. The results show that even with less than 2h of data, DNNs already outperform HMMs.


 DOI: 10.21437/Interspeech.2017-936

Cite as: Maguer, S.L., Steiner, I., Hewer, A. (2017) An HMM/DNN Comparison for Synchronized Text-to-Speech and Tongue Motion Synthesis. Proc. Interspeech 2017, 239-243, DOI: 10.21437/Interspeech.2017-936.


@inproceedings{Maguer2017,
  author={S├ębastien Le Maguer and Ingmar Steiner and Alexander Hewer},
  title={An HMM/DNN Comparison for Synchronized Text-to-Speech and Tongue Motion Synthesis},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={239--243},
  doi={10.21437/Interspeech.2017-936},
  url={http://dx.doi.org/10.21437/Interspeech.2017-936}
}