Odyssey 2012 - The Speaker and Language Recognition Workshop
The degradation in performance of a typical speaker verification system in noisy environments can be attributed to the mis-match in the features derived from clean training and noisy test conditions. The mis-match is severe in low-energy regions of the signal where noise dominates the speech signal. A robust feature extraction scheme should focus on the high-energy peaks in the time-frequency region. In this paper, we develop a signal analysis technique which attempts to model these high-energy peaks using two-dimensional (2-D) autoregressive (AR) models. The first AR model of the sub-band Hilbert envelopes is derived using frequency domain linear prediction (FDLP). Then, these all-pole envelopes from each sub-band are converted to short-term energy estimates and the energy values across various sub-bands are used as a sampled power spectral estimate for the second AR model. The output prediction coefficients from the second AR model are converted to cepstral coefficients and are used for speaker recognition. Experiments are performed using noisy versions of NIST 2010 speaker recognition evaluation (SRE) data with the state-of-art speaker recognition system. In these experiments, the proposed features provide significant improvements compared to baseline MFCC features (relative improvements of 30%). We also experiment on a large dataset of IARPA NIST 2011 speaker recognition challenge, where the 2-D AR model provides noticeable improvements (relative improvements of 15 - 20%).
Bibliographic reference. Ganapathy, Sriram / Thomas, Samuel / Hermansky, Hynek (2012): "Feature extraction using 2-d autoregressive models for speaker recognition", In Odyssey-2012, 229-235.