12th Annual Conference of the International Speech Communication Association

Florence, Italy
August 27-31. 2011

Phoneme-Level Text to Audio Synchronization on Speech Signals with Background Music

Agnès Pedone, Juan José Burred, Simon Maller, Pierre Leveau

Audionamix, France

We address the task of synchronizing a given phoneme transcription with the corresponding speech signal, when the latter is linearly mixed with background music. To that end, we propose a new method based on Non-negative Matrix Factorization in the time-frequency domain, which models the speech as a source-filter factorization that includes a synchronization parameter matrix. Phoneme models, which consist of collections of basic spectral envelopes, are learned from a training set of isolated speech. The model is subjected to an iterative Maximum Likelihood optimization that concurrently estimates pitch, synchronization parameters and the contribution of the music part. Results show the feasibility of the system for application in text-informed audio processing and automatic subtitle synchronization.

Full Paper

