Articulation-to-Speech Synthesis Using Articulatory Flesh Point Sensors’ Orientation Information

Beiming Cao, Myungjong Kim, Jun R. Wang, Jan van Santen, Ted Mau, Jun Wang


Articulation-to-speech (ATS) synthesis generates audio waveform directly from articulatory information. Current works in ATS used articulatory movement information (spatial coordinates) only. The orientation information of articulatory flesh points has rarely been used, although some devices (e.g., electromagnetic articulography) provide that. Previous work indicated that orientation information contains significant information for speech production. In this paper, we explored the performance of applying orientation information of flesh points on articulators (i.e., tongue, lips and jaw) in ATS. Experiments using articulators' movement information with or without orientation information were conducted using standard deep neural networks (DNNs) and long-short term memory-recurrent neural networks (LSTM-RNNs). Both objective and subjective evaluations indicated that adding orientation information of flesh points on articulators in addition to movement information generated higher quality speech output than using movement information only.


 DOI: 10.21437/Interspeech.2018-2484

Cite as: Cao, B., Kim, M., Wang, J.R., van Santen, J., Mau, T., Wang, J. (2018) Articulation-to-Speech Synthesis Using Articulatory Flesh Point Sensors’ Orientation Information. Proc. Interspeech 2018, 3152-3156, DOI: 10.21437/Interspeech.2018-2484.


@inproceedings{Cao2018,
  author={Beiming Cao and Myungjong Kim and Jun R. Wang and Jan {van Santen} and Ted Mau and Jun Wang},
  title={Articulation-to-Speech Synthesis Using Articulatory Flesh Point Sensors’ Orientation Information},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={3152--3156},
  doi={10.21437/Interspeech.2018-2484},
  url={http://dx.doi.org/10.21437/Interspeech.2018-2484}
}