ODYSSEY 2004 - The Speaker and Language Recognition Workshop
May 31 - June 3, 2004
Speaker recognition systems employ a speech detection algorithm and use only frames detected as speech for further processing. The accuracy obtained by a speaker recognition system depends on the method that is used to detect speech, in particular for real-life deployments where the incoming speech varies significantly in loudness and noise characteristics. Also, actual deployments mandate real time processing, where look-ahead should be minimized and eliminated if possible, and where the speech detector cannot rely on statistics of speech features such as energy levels across the entire utterance. This makes many prevalent speech detection methods that use energy statistics unsuitable for speaker recognition deployments. Also, speech detection in text independent speaker recognition systems is more challenging compared to text dependent systems since there is no inherent validation and/or detection of the spoken. In this paper we describe a robust speech detection method based on voicing score estimation that allows for real time speech detection, and compare it to other real time methods in different conditions. All tested algorithms satisfy the requirements of exhibiting consistent performance across different data sets that have different noise characteristics, and operating in real time. The voicing-based algorithm is shown to perform significantly better than other tested speech detection algorithms.
Bibliographic reference. Zilca, Ran D. / Pelecanos, Jason W. / Chaudhari, Upendra V. / Ramaswamy, Ganesh N. (2004): "Real time robust speech detection for text independent speaker recognition", In ODYS-2004, 123-128.