We present a strategy for learning Deep-Neural-Network (DNN)- based Acoustic-to-Articulatory Mapping (AAM) functions where the contribution of an articulatory feature (AF) to the global recon- struction error is weighted by its relevance. We first empirically show that when an articulator is more crucial for the production of a given phone it is less variable, confirming previous findings. We then compute the relevance of an articulatory feature as a function of its frame-wise variance dependent on the acoustic evidence which is estimated through a Mixture Density Network (MDN). Finally we combine acoustic and recovered articulatory features in a hybrid DNN-HMM phone recognizer. Tested on the MOCHATIMIT corpus, articulatory features reconstructed by a standardly trained DNN lead to a 8.4% relative phone error reduction (w.r.t. a recognizer that only uses MFCCs), whereas when the articulatory features are reconstructed taking into account their relevance the relative phone error reduction increased to 10.9%.
Bibliographic reference. Canevari, Claudia / Badino, Leonardo / Fadiga, Luciano / Metta, Giorgio (2013): "Relevance-weighted-reconstruction of articulatory features in deep-neural-network-based acoustic-to-articulatory mapping", In INTERSPEECH-2013, 1297-1301.