Odyssey 2010: The Speaker and Language Recognition Workshop

Brno, Czech Republic
28 June 1 July 2010

Evaluation of the Vulnerability of Speaker Verification to Synthetic Speech

Phillip DeLeon (1), Michael Pucher (2), Junichi Yamagishi (3)

(1) New Mexico State University, (2) Telecommunications Research Center (FTW), (3) University of Edinburgh

In this paper, we evaluate the vulnerability of a speaker verification (SV) system to synthetic speech. Although this problem was first examined over a decade ago, dramatic improvements in both SV and speech synthesis have renewed interest in this problem. We use a HMM-based speech synthesizer which creates synthetic speech for a targeted speaker through adaptation of a background model and a GMM-UBM-based SV system. Using 283 speakers from the Wall-Street Journal (WSJ) corpus, our SV system has a 0.4% EER. When the system is tested with synthetic speech generated from speaker models derived from the WSJ journal corpus, 90% of the matched claims are accepted. This result suggests a possible vulnerability in SV systems to synthetic speech. In order to detect synthetic speech prior to recognition, we investigate the use of an automatic speech recognizer (ASR), dynamic-time-warping (DTW) distance of mel-frequency cepstral coefficients (MFCC), and previously-proposed average inter-frame difference of log-likelihood (IFDLL). Overall, while SV systems have impressive accuracy, even with the proposed detector, high-quality synthetic speech can lead to an unacceptably high acceptance rate of synthetic speakers.

Full Paper (PDF)

Bibliographic reference.  DeLeon, Phillip / Pucher, Michael / Yamagishi, Junichi (2010): "Evaluation of the Vulnerability of Speaker Verification to Synthetic Speech", In Odyssey-2010, paper 028.