In this paper we present our initial results in articulatory-to-acoustic conversion based on tongue movement recordings using Deep Neural Networks (DNNs). Despite the fact that deep learning has revolutionized several fields, so far only a few researchers have applied DNNs for this task. Here, we compare various possible feature representation approaches combined with DNN-based regression. As the input, we recorded synchronized 2D ultrasound images and speech signals. The task of the DNN was to estimate Mel-Generalized Cepstrum-based Line Spectral Pair (MGC-LSP) coefficients, which then served as input to a standard pulse-noise vocoder for speech synthesis. As the raw ultrasound images have a relatively high resolution, we experimented with various feature selection and transformation approaches to reduce the size of the feature vectors. The synthetic speech signals resulting from the various DNN configurations were evaluated both using objective measures and a subjective listening test. We found that the representation that used several neighboring image frames in combination with a feature selection method was preferred both by the subjects taking part in the listening experiments, and in terms of the Normalized Mean Squared Error. Our results may be useful for creating Silent Speech Interface applications in the future.
Cite as: Csapó, T.G., Grósz, T., Gosztolya, G., Tóth, L., Markó, A. (2017) DNN-Based Ultrasound-to-Speech Conversion for a Silent Speech Interface. Proc. Interspeech 2017, 3672-3676, doi: 10.21437/Interspeech.2017-939
@inproceedings{csapo17_interspeech, author={Tamás Gábor Csapó and Tamás Grósz and Gábor Gosztolya and László Tóth and Alexandra Markó}, title={{DNN-Based Ultrasound-to-Speech Conversion for a Silent Speech Interface}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={3672--3676}, doi={10.21437/Interspeech.2017-939} }