Multi-Task Learning of Speech Recognition and Speech Synthesis Parameters for Ultrasound-based Silent Speech Interfaces

László Tóth, Gábor Gosztolya, Tamás Grósz, Alexandra Markó, Tamás Gábor Csapó


Silent Speech Interface systems apply two different strategies to solve the articulatory-to-acoustic conversion task. The recognition-and-synthesis approach applies speech recognition techniques to map the articulatory data to a textual transcript, which is then converted to speech by a conventional text-to-speech system. The direct synthesis approach seeks to convert the articulatory information directly to speech synthesis (vocoder) parameters. In both cases, deep neural networks are an evident and popular choice to learn the mapping task. Recognizing that the learning of speech recognition and speech synthesis targets (acoustic model states vs. vocoder parameters) are two closely related tasks over the same ultrasound tongue image input, here we experiment with the multi-task training of deep neural networks, which seeks to solve the two tasks simultaneously. Our results show that the parallel learning of the two types of targets is indeed beneficial for both tasks. Moreover, we obtained further improvements by using multi-task training as a weight initialization step before task-specific training. Overall, we report a relative error rate reduction of about 7% in both the speech recognition and the speech synthesis tasks.


 DOI: 10.21437/Interspeech.2018-1078

Cite as: Tóth, L., Gosztolya, G., Grósz, T., Markó, A., Csapó, T.G. (2018) Multi-Task Learning of Speech Recognition and Speech Synthesis Parameters for Ultrasound-based Silent Speech Interfaces. Proc. Interspeech 2018, 3172-3176, DOI: 10.21437/Interspeech.2018-1078.


@inproceedings{Tóth2018,
  author={László Tóth and Gábor Gosztolya and Tamás Grósz and Alexandra Markó and Tamás Gábor Csapó},
  title={Multi-Task Learning of Speech Recognition and Speech Synthesis Parameters for Ultrasound-based Silent Speech Interfaces},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={3172--3176},
  doi={10.21437/Interspeech.2018-1078},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1078}
}