Auditory-Visual Speech Processing (AVSP) 2010

Hakone, Kanagawa, Japan
September 30-October 3, 2010

Acoustic-to-Articulatory Inversion in Speech Based on Statistical Models

Atef Ben Youssef, Pierre Badin, Gérard Bailly

GIPSA-lab (Département Parole & Cognition / ICP), Grenoble University, France

Two speech inversion methods are implemented and compared. In the first, multistream Hidden Markov Models (HMMs) of phonemes are jointly trained from synchronous streams of articulatory data acquired by EMA and speech spectral parameters; an acoustic recognition system uses the acoustic part of the HMMs to deliver a phoneme chain and the states durations; this information is then used by a trajectory formation procedure based on the articulatory part of the HMMs to resynthesise the articulatory movements. In the second, Gaussian Mixture Models (GMMs) are trained on these streams to directly associate articulatory frames with acoustic frames in context, using Maximum Likelihood Estimation. Over a corpus of 17 minutes uttered by a French speaker, the RMS error was 1.62 mm with the HMMs and 2.25 mm with the GMMs.

Index Terms: Speech inversion, ElectroMagnetic Articulography (EMA), Hidden Markov Model (HMM), Gaussian Mixture Model (GMM), Maximum Likelihood Estimation (MLE).

Full Paper

Bibliographic reference.  Youssef, Atef Ben / Badin, Pierre / Bailly, Gérard (2010): "Acoustic-to-articulatory inversion in speech based on statistical models", In AVSP-2010, paper S8-3.