ISCA Archive SSW 2021
ISCA Archive SSW 2021

Extending Text-to-Speech Synthesis with Articulatory Movement Prediction using Ultrasound Tongue Imaging

Tamás Gábor Csapó

In this paper, we present our first experiments in text-toarticulation prediction, using ultrasound tongue image targets. We extend a traditional (vocoder-based) DNN-TTS framework with predicting PCA-compressed ultrasound images, of which the continuous tongue motion can be reconstructed in synchrony with synthesized speech. We use the data of eight speakers, train fully connected and recurrent neural networks, and show that FC-DNNs are more suitable for the prediction of sequential data than LSTMs, in case of limited training data. Objective experiments and visualized predictions show that the proposed solution is feasible and the generated ultrasound videos are close to natural tongue movement. Articulatory movement prediction from text input can be useful for audiovisual speech synthesis or computer-assisted pronunciation training.


doi: 10.21437/SSW.2021-2

Cite as: Csapó, T.G. (2021) Extending Text-to-Speech Synthesis with Articulatory Movement Prediction using Ultrasound Tongue Imaging. Proc. 11th ISCA Speech Synthesis Workshop (SSW 11), 7-12, doi: 10.21437/SSW.2021-2

@inproceedings{csapo21_ssw,
  author={Tamás Gábor Csapó},
  title={{Extending Text-to-Speech Synthesis with Articulatory Movement Prediction using Ultrasound Tongue Imaging}},
  year=2021,
  booktitle={Proc. 11th ISCA Speech Synthesis Workshop (SSW 11)},
  pages={7--12},
  doi={10.21437/SSW.2021-2}
}