We propose a practical, feature-level fusion approach for combining acoustic and articulatory information in speaker verification task. We find that concatenating articulation features obtained from the measured speech production data with conventional Mel-frequency cepstral coefficients (MFCCs) improves the overall speaker verification performance. However, since access to the measured articulatory data is impractical for real world speaker verification applications, we also experiment with estimated articulatory features obtained using acoustic-to-articulatory inversion technique. Specifically, we show that augmenting MFCCs with articulatory features obtained from subject-independent acousticto- articulatory inversion technique also significantly enhances the speaker verification performance. This performance boost could be due to the information about inter-speaker variation present in the estimated articulatory features, especially at the mean and variance level. Experimental results on the Wisconsin X-Ray Microbeam database show that the proposed acoustic-estimated-articulatory fusion approach significantly outperforms the traditional acousticonly baseline, providing up to 10% relative reduction in Equal Error Rate (EER). We further show that we can achieve an additional 5% relative reduction in EER after score-level fusion.
Bibliographic reference. Li, Ming / Kim, Jangwon / Ghosh, Prasanta Kumar / Ramanarayanan, Vikram / Narayanan, Shrikanth (2013): "Speaker verification based on fusion of acoustic and articulatory information", In INTERSPEECH-2013, 1614-1618.