We address the task of synchronizing a given phoneme transcription with the corresponding speech signal, when the latter is linearly mixed with background music. To that end, we propose a new method based on Non-negative Matrix Factorization in the time-frequency domain, which models the speech as a source-filter factorization that includes a synchronization parameter matrix. Phoneme models, which consist of collections of basic spectral envelopes, are learned from a training set of isolated speech. The model is subjected to an iterative Maximum Likelihood optimization that concurrently estimates pitch, synchronization parameters and the contribution of the music part. Results show the feasibility of the system for application in text-informed audio processing and automatic subtitle synchronization.
Bibliographic reference. Pedone, Agnès / Burred, Juan José / Maller, Simon / Leveau, Pierre (2011): "Phoneme-level text to audio synchronization on speech signals with background music", In INTERSPEECH-2011, 433-436.