Interspeech'2005 - Eurospeech

Lisbon, Portugal
September 4-8, 2005

Discrimination Between Singing and Speaking Voices

Yasunori Ohishi (1), Masataka Goto (2), Katunobu Itou (1), Kazuya Takeda (1)

(1) Nagoya University, Japan; (2) AIST, Japan

Discriminating between singing and speaking voices by using the local and global characteristics of voice signals is discussed. From the results of subjective experiments, we show that human beings can discriminate singing and speaking voices with more than 70% and 95% accuracy from 300 ms and one second long signals, respectively. From the subjective experiment results, assuming that different features are effective for short-term and long-term signals, we designed two measures using a spectral envelope (MFCC) and the fundamental frequency (F0, perceived as pitch) contour. Experimental results show that the F0 measure performs better than the spectral envelope measure when the input voice signals are longer than one second. Particularly, it can discriminate singing and speaking voices with more than 80% accuracy with two-second signals. On the other hand, when the input signals are shorter than one second, the spectral envelope measure performs better than the F0 measure. Finally, by simply combining the two measures, more than 90% accuracy is obtained for two-second signals.

Full Paper

Bibliographic reference.  Ohishi, Yasunori / Goto, Masataka / Itou, Katunobu / Takeda, Kazuya (2005): "Discrimination between singing and speaking voices", In INTERSPEECH-2005, 1141-1144.