EUROSPEECH 2003 - INTERSPEECH 2003
8th European Conference on Speech Communication and Technology

Geneva, Switzerland
September 1-4, 2003

        

Fusing High- and Low-Level Features for Speaker Recognition

Joseph P. Campbell, Douglas A. Reynolds, Robert B. Dunn

Massachusetts Institute of Technology, USA

The area of automatic speaker recognition has been dominated by systems using only short-term, low-level acoustic information, such as cepstral features. While these systems have produced low error rates, they ignore higher levels of information beyond low-level acoustics that convey speaker information. Recently published works have demonstrated that such high-level information can be used successfully in automatic speaker recognition systems by improving accuracy and potentially increasing robustness. Wide ranging high-level-feature-based approaches using pronunciation models, prosodic dynamics, pitch gestures, phone streams, and conversational interactions were explored and developed under the SuperSID project at the 2002 JHU CLSP Summer Workshop (WS2002): http://www.clsp.jhu.edu/ws2002/groups/supersid/. In this paper, we show how these novel features and classifiers provide complementary information and can be fused together to drive down the equal error rate on the 2001 NIST Extended Data Task to 0.2% - a 71% relative reduction in error over the previous state of the art.

Full Paper

Bibliographic reference.  Campbell, Joseph P. / Reynolds, Douglas A. / Dunn, Robert B. (2003): "Fusing high- and low-level features for speaker recognition", In EUROSPEECH-2003, 2665-2668.