Sixth European Conference on Speech Communication and Technology

Budapest, Hungary
September 5-9, 1999

Feature Fusion for Music Detection

Eluned S. Parris, Michael J. Carey, Harvey Lloyd-Thomas

Ensigma Ltd., Turing House, Station Road, Chepstow, Monmouthshire, UK

Automatic discrimination between music, speech and noise has grown in importance as a research topic over recent years. The need to classify audio into categories such as music or speech is an important part of the multimedia document retrieval problem. This paper extends work previously carried out by the authors which compared performance of static and transitional features based on cepstra, amplitude, zero-crossings and pitch for music and speech discrimination. Two approaches are described to combine the features to improve overall performance. The first approach uses separate GMM classifiers for each feature type and fuses the outputs of the classifiers. The second approach combines different features into a single vector prior to modelling the data with a GMM. Significant improvements in performance have been observed using both approaches over the results achieved by a single type of feature. An equal error rate of 0.3% is achieved for the best system on ten second tests using seventeen hours of test material. The performance is maintained as the length of test file is reduced with an equal error rate of less than 1% being achieved with only two seconds of data.

Full Paper (PDF)   Gnu-Zipped Postscript

Bibliographic reference.  Parris, Eluned S. / Carey, Michael J. / Lloyd-Thomas, Harvey (1999): "Feature fusion for music detection", In EUROSPEECH'99, 2191-2194.