ISCA Tutorial and Research Workshop on Statistical and Perceptual Audition (SAPA2008)

Brisbane, Australia
September 21, 2008

Combining Pitch-Based Inference and Non-Negative Spectrogram Factorization in Separating Vocals from Polyphonic Music

Tuomas Virtanen, Annamaria Mesaros, Matti Ryynänen

Department of Signal Processing, Tampere University of Technology, Finland

This paper proposes a novel algorithm for separating vocals from polyphonic music accompaniment. Based on pitch estimation, the method first creates a binary mask indicating timefrequency segments in the magnitude spectrogram where harmonic content of the vocal signal is present. Second, nonnegative matrix factorization (NMF) is applied on the non-vocal segments of the spectrogram in order to learn a model for the accompaniment. NMF predicts the amount of noise in the vocal segments, which allows separating vocals and noise even when they overlap in time and frequency. Simulations with commercial and synthesized acoustic material show an average improvement of 1.3 dB and 1.8 dB, respectively, in comparison with a reference algorithm based on sinusoidal modeling, and also the perceptual quality of the separated vocals is clearly improved. The method was also tested in aligning separated vocals and textual lyrics, where it produced better results than the reference method.

Full Paper

Bibliographic reference.  Virtanen, Tuomas / Mesaros, Annamaria / Ryynänen, Matti (2008): "Combining pitch-based inference and non-negative spectrogram factorization in separating vocals from polyphonic music", In SAPA-2008, 17-22.