This paper presents an approach to articulatory inversion using audio and video of the user¡¯s face, requiring no special markers. The video is stabilized with respect to the face, and the mouth region cropped out. The mouth image is projected into a learned independent component subspace to obtain a low-dimensional representation of the mouth appearance. The inversion problem is treated as one of regression; a non-linear regressor using relevance vector machines is trained with a dataset of simultaneous images of a subject¡¯s face, acoustic features and positions of magnetic coils glued to the subjects¡¯s tongue. The results show the benefit of using both cues for inversion. We envisage the inversion method to be part of a pronunciation training system with articulatory feedback.
Cite as: Kjellström, H., Engwall, O., Bälter, O. (2006) Reconstructing tongue movements from audio and video. Proc. Interspeech 2006, paper 1071-Thu1A3O.4, doi: 10.21437/Interspeech.2006-321
@inproceedings{kjellstrom06_interspeech, author={Hedvig Kjellström and Olov Engwall and Olle Bälter}, title={{Reconstructing tongue movements from audio and video}}, year=2006, booktitle={Proc. Interspeech 2006}, pages={paper 1071-Thu1A3O.4}, doi={10.21437/Interspeech.2006-321} }