ISCA Archive Interspeech 2006
ISCA Archive Interspeech 2006

Reconstructing tongue movements from audio and video

Hedvig Kjellström, Olov Engwall, Olle Bälter

This paper presents an approach to articulatory inversion using audio and video of the user¡¯s face, requiring no special markers. The video is stabilized with respect to the face, and the mouth region cropped out. The mouth image is projected into a learned independent component subspace to obtain a low-dimensional representation of the mouth appearance. The inversion problem is treated as one of regression; a non-linear regressor using relevance vector machines is trained with a dataset of simultaneous images of a subject¡¯s face, acoustic features and positions of magnetic coils glued to the subjects¡¯s tongue. The results show the benefit of using both cues for inversion. We envisage the inversion method to be part of a pronunciation training system with articulatory feedback.


doi: 10.21437/Interspeech.2006-321

Cite as: Kjellström, H., Engwall, O., Bälter, O. (2006) Reconstructing tongue movements from audio and video. Proc. Interspeech 2006, paper 1071-Thu1A3O.4, doi: 10.21437/Interspeech.2006-321

@inproceedings{kjellstrom06_interspeech,
  author={Hedvig Kjellström and Olov Engwall and Olle Bälter},
  title={{Reconstructing tongue movements from audio and video}},
  year=2006,
  booktitle={Proc. Interspeech 2006},
  pages={paper 1071-Thu1A3O.4},
  doi={10.21437/Interspeech.2006-321}
}