7th International Conference on Spoken Language Processing
September 16-20, 2002
The problem of speech, music and music with songs discrimination in telephony with handsets variability is addressed in this paper. Two systems are proposed. The first system uses three Gaussian Mixture Models (GMM) for speech, music and songs respectively. Each GMM comprises 8 Gaussians trained on very short sessions. Twenty six speakers (13 females, 13 males) have been randomly chosen from the SPIDRE corpus. The music were obtained from a large set of data and comprises various styles. For 138 minutes of testing time, a speech discrimination score of 97.9% is obtained when no channel normalization is used. These performance are obtained for a relatively short analysis frame (32ms sliding window, buffering of 100 ms). When using channel normalization, an important score reduction (on the order of 10 to 20%) is observed. The second system has been designed for applications requiring shorter processing times along with shorter training sessions. It is based on an empirical transformation of the . MFCC that enhances the dynamical evolution of tonality. It yields in average an acceptable discrimination rate of 90% (speech-/music) and 84% (speech, music and songs with music).
Bibliographic reference. Ezzaidi, Hassan / Rouat, Jean (2002): "Speech, music and songs discrimination in the context of handsets variability", In ICSLP-2002, 2013-2016.