This paper proposes a model of speaker-normalized acoustic-to-articulatory mapping using statistical voice conversion. A mapping function from acoustic parameters to articulatory parameters is usually developed with a single speaker's parallel data. Hence the constructed mapping model can work appropriately only for this specific speaker, and applying this model to other speakers degrades the performance of acoustic-to-articulatory mapping. In this paper, two models of speaker conversion and acoustic-to-articulatory mapping are implemented using Gaussian Mixture Models (GMM), and by integrating these two models, we propose two methods of speaker-normalized acoustic-to-articulatory mapping. One is concatenating these models sequentially, and the other integrates the two models into a unified model, where acoustic parameters of a speaker can be converted directly to articulatory parameters of another speaker. Experiments show that both methods can improve the mapping accuracy and that the latter method works better than the former method. Especially in the case of velar stop consonants, the mapping accuracy is higher by 0.6 mm.
Bibliographic reference. Uchida, Hidetsugu / Saito, Daisuke / Minematsu, Nobuaki / Hirose, Keikichi (2015): "Statistical acoustic-to-articulatory mapping unified with speaker normalization based on voice conversion", In INTERSPEECH-2015, 588-592.