14thAnnual Conference of the International Speech Communication Association

Lyon, France
August 25-29, 2013

Frequency Warping and Robust Speaker Verification: A Comparison of Alternative Mel-Scale Representations

Tomi Kinnunen (1), Md. Jahangir Alam (2), Pavel Matějka (3), Patrick Kenny (2), Jan Černocký (3), Douglas O'Shaughnessy (4)

(1) University of Eastern Finland, Finland
(2) CRIM, Canada
(3) Brno University of Technology, Czech Republic
(4) INRS-EMT, Canada

Accuracy of speaker verification is high under controlled conditions but falls off rapidly in the presence of interfering sounds. This is because spectral features, such as Mel-frequency cepstral coefficients (MFCCs), are sensitive to additive noise. MFCCs are a particular realization of warped-frequency representation with lowfrequency focus. But there are several alternative, potentially more robust, warped-frequency representations. We provide an experimental comparison of five warped-frequency features. They use exactly the same frequency warping function, the same number of coefficients and postprocessing, but differ in their internal computations. The compared variants are (1) conventional MFCCs from discrete Fourier transform (DFT), followed by Mel-scaled filterbank, (2) MFCCs via direct warping of DFT, followed by linear-scale filterbank, (3) warped linear prediction features, (4) perceptual minimum variance distortionless features and (5) recently proposed sparse Mel-scale histogram features. Experiments carried out on a subset of the SRE 10 corpus using a scaled-down i-vector system indicate that direct DFT warping outperforms conventional MFCCs in most of the cases.

Full Paper

Bibliographic reference.  Kinnunen, Tomi / Alam, Md. Jahangir / Matějka, Pavel / Kenny, Patrick / Černocký, Jan / O'Shaughnessy, Douglas (2013): "Frequency warping and robust speaker verification: a comparison of alternative mel-scale representations", In INTERSPEECH-2013, 3122-3126.