Deep Neural Network Based Acoustic-to-Articulatory Inversion Using Phone Sequence Information

Xurong Xie, Xunying Liu, Lan Wang


In recent years, neural network based acoustic-to-articulatory inversion approaches have achieved the state-of-the-art performance. One major issue associated with these approaches is the lack of phone sequence information during inversion. In order to address this issue, this paper proposes an improved architecture hierarchically concatenating phone classification and articulatory inversion component DNNs to improve articulatory movement generation. On a Mandarin Chinese speech inversion task, the proposed technique consistently outperformed a range of baseline DNN and RNN inversion systems constructed using no phone sequence information, a mixture density parameter output layer, additional phone features at the input layer, or multi-task learning with additional monophone output layer target labels, measured in terms of electromagnetic articulography (EMA) root mean square error (RMSE) and correlation. Further improvements were obtained using the bottleneck features extracted from the proposed hierarchical articulatory inversion systems as auxiliary features in generalized variable parameter HMMs (GVP-HMMs) based inversion systems.


DOI: 10.21437/Interspeech.2016-659

Cite as

Xie, X., Liu, X., Wang, L. (2016) Deep Neural Network Based Acoustic-to-Articulatory Inversion Using Phone Sequence Information. Proc. Interspeech 2016, 1497-1501.

Bibtex
@inproceedings{Xie+2016,
author={Xurong Xie and Xunying Liu and Lan Wang},
title={Deep Neural Network Based Acoustic-to-Articulatory Inversion Using Phone Sequence Information},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-659},
url={http://dx.doi.org/10.21437/Interspeech.2016-659},
pages={1497--1501}
}