Interspeech'2005 - Eurospeech
Discriminating between singing and speaking voices by using the local and global characteristics of voice signals is discussed. From the results of subjective experiments, we show that human beings can discriminate singing and speaking voices with more than 70% and 95% accuracy from 300 ms and one second long signals, respectively. From the subjective experiment results, assuming that different features are effective for short-term and long-term signals, we designed two measures using a spectral envelope (MFCC) and the fundamental frequency (F0, perceived as pitch) contour. Experimental results show that the F0 measure performs better than the spectral envelope measure when the input voice signals are longer than one second. Particularly, it can discriminate singing and speaking voices with more than 80% accuracy with two-second signals. On the other hand, when the input signals are shorter than one second, the spectral envelope measure performs better than the F0 measure. Finally, by simply combining the two measures, more than 90% accuracy is obtained for two-second signals.
Bibliographic reference. Ohishi, Yasunori / Goto, Masataka / Itou, Katunobu / Takeda, Kazuya (2005): "Discrimination between singing and speaking voices", In INTERSPEECH-2005, 1141-1144.