9th Annual Conference of the International Speech Communication Association

Brisbane, Australia
September 22-26, 2008

Rhythm Based Music Segmentation and Octave Scale Cepstral Features for Sung Language Recognition

Namunu C. Maddage, Haizhou Li

Institute for Infocomm Research, Singapore

Sung language recognition relies on both effective feature extraction and acoustic modeling. In this paper, we study rhythm based music segmentation in which the frame size varies in proportion to inter-beat interval of the music, in contrast to fixed length segmentation (FIX) in spoken language recognition. We show that acoustic feature extracted from the BSS scheme outperforms that from FIX. We also compare the effectiveness of musically motivated acoustic features, Octave scale cepstral coefficients (OSCCs) with Log frequency cepstral coefficients. We adopt Gaussian mixture model for sung language classifier design. Experiments are conducted on a database of 400 popular songs sung in four languages, including English, Chinese, German and Indonesian, which show that OSCC feature outperforms other features. We achieve 64.9% of sung language identification accuracy with Gaussian mixture models trained on shifted-delta- cepstral OSCC acoustic features extracted via BSS.

Full Paper

Bibliographic reference.  Maddage, Namunu C. / Li, Haizhou (2008): "Rhythm based music segmentation and octave scale cepstral features for sung language recognition", In INTERSPEECH-2008, 2526-2529.