Speaker recognition in noisy environments is challenging when there is a mis-match in the data used for enrollment and verification. In this paper, we propose a robust feature extraction scheme based on spectro-temporal modulation filtering using twodimensional (2-D) autoregressive (AR) models. The first step is the AR modeling of the sub-band temporal envelopes by the application of the linear prediction on the sub-band discrete cosine transform (DCT) components. These sub-band envelopes are stacked together and used for a second AR modeling step. The spectral envelope across the sub-bands is approximated in this AR model and cepstral features are derived which are used for speaker recognition. The use of AR models emphasizes the focus on the high energy regions which are relatively well preserved in the presence of noise. The degree of modulation filtering is controlled using AR model order parameter. Experiments are performed using noisy versions of NIST 2010 speaker recognition evaluation (SRE) data with a state-ofart speaker recognition system. In these experiments, the proposed features provide significant improvements compared to baseline features (relative improvements of 20% in terms of equal error rate (EER) and 35% in terms of miss rate at 10% false alarm).
Bibliographic reference. Mallidi, Sri Harish / Ganapathy, Sriram / Hermansky, Hynek (2013): "Robust speaker recognition using spectro-temporal autoregressive models", In INTERSPEECH-2013, 3689-3693.