Odyssey 2010: The Speaker and Language Recognition Workshop

Brno, Czech Republic
28 June – 1 July 2010

An i-vector Extractor Suitable for Speaker Recognition with both Microphone and Telephone Speech

Mohammed Senoussaoui (1), Patrick Kenny (2), Najim Dehak (3), Pierre Dumouchel (1)

(1) Ecole de Technologie Supérieur (ETS) and Centre de Recherche Informatique de Montréal (CRIM) Canada, (2) Centre de Recherche Informatique de Montréal (CRIM) Canada, (3) Spoken language system, CSAIL -MIT, Cambridge USA

It is widely believed that speaker verification systems perform better when there is sufficient background training data to deal with nuisance effects of transmission channels. It is also known that these systems perform at their best when the sound environment of the training data is similar to that of the context of use (test context). For some applications however, training data from the same type of sound environment is scarce, whereas a considerable amount of data from a different type of environment is available. In this paper, we propose a new architecture for text-independent speaker verification systems that are satisfactorily trained by virtue of a limited amount of application-specific data, supplemented with a sufficient amount of training data from some other context. This architecture is based on the extraction of parameters (i-vectors) from a low-dimensional space (total variability space) proposed by Dehak [1]. Our aim is to extend Dehak's work to speaker recognition on sparse data, namely microphone speech. The main challenge is to overcome the fact that insufficient application-specific data is available to accurately estimate the total variability covariance matrix. We propose a method based on Joint Factor Analysis (JFA) to estimate microphone eigenchannels (sparse data) with telephone eigenchannels (sufficient data). For classification, we experimented with the following two approaches: Support Vector Machines (SVM) and Cosine Distance Scoring (CDS) classifier, based on cosine distances. We present recognition results for the part of female voices in the interview data of the NIST 2008 SRE. The best performance is obtained when our system is fused with the state-of-the-art JFA. We achieve 13% relative improvement on equal error rate and the minimum value of detection cost function decreases from 0.0219 to 0.0164.

Full Paper (PDF)

Bibliographic reference.  Senoussaoui, Mohammed / Kenny, Patrick / Dehak, Najim / Dumouchel, Pierre (2010): "An i-vector Extractor Suitable for Speaker Recognition with both Microphone and Telephone Speech", In Odyssey-2010, paper 006.