ISCA Archive Interspeech 2017
ISCA Archive Interspeech 2017

An HMM/DNN Comparison for Synchronized Text-to-Speech and Tongue Motion Synthesis

Sébastien Le Maguer, Ingmar Steiner, Alexander Hewer

We present an end-to-end text-to-speech (TTS) synthesis system that generates audio and synchronized tongue motion directly from text. This is achieved by adapting a statistical shape space model of the tongue surface to an articulatory speech corpus and training a speech synthesis system directly on the tongue model parameter weights. We focus our analysis on the application of two standard methodologies, based on Hidden Markov Models (HMMs) and Deep Neural Networks (DNNs), respectively, to train both acoustic models and the tongue model parameter weights. We evaluate both methodologies at every step by comparing the predicted articulatory movements against the reference data. The results show that even with less than 2h of data, DNNs already outperform HMMs.


doi: 10.21437/Interspeech.2017-936

Cite as: Maguer, S.L., Steiner, I., Hewer, A. (2017) An HMM/DNN Comparison for Synchronized Text-to-Speech and Tongue Motion Synthesis. Proc. Interspeech 2017, 239-243, doi: 10.21437/Interspeech.2017-936

@inproceedings{maguer17_interspeech,
  author={Sébastien Le Maguer and Ingmar Steiner and Alexander Hewer},
  title={{An HMM/DNN Comparison for Synchronized Text-to-Speech and Tongue Motion Synthesis}},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={239--243},
  doi={10.21437/Interspeech.2017-936}
}