INTERSPEECH 2012
13th Annual Conference of the International Speech Communication Association

Portland, OR, USA
September 9-13, 2012

Synthetic Speech Discrimination using Pitch Pattern Statistics Derived from Image Analysis

Phillip L. De Leon (1), Bryan Stewart (1), Junichi Yamagishi (2)

(1) New Mexico State University, Klipsch School of Elect. and Comp. Eng., Las Cruces, NM, USA
(2) University of Edinburgh, Centre for Speech Technology Research (CSTR), Edinburgh, UK

In this paper, we extend the work by Ogihara, et al. to discriminate between human and synthetic speech using features based on pitch patterns. As previously demonstrated, significant differences in pitch patterns between human and synthetic speech can be leveraged to classify speech as being human or synthetic in origin. We propose using mean pitch stability, mean pitch stability range, and jitter as features extracted after image analysis of pitch patterns. We have observed that for synthetic speech, these features lie in a small and distinct space as compared to human speech and have modeled them with a multivariate Gaussian distribution. Our classifier is trained using synthetic speech collected from the 2008 and 2011 Blizzard Challenge along with Festival pre-built voices and human speech from the NIST2002 corpus. We evaluate the classifier on a much larger corpus than previously studied using human speech from the Switchboard corpus, synthetic speech from the Resource Management corpus, and synthetic speech generated from Festival trained on the Wall Street Journal corpus. Results show 98% accuracy in correctly classifying human speech and 96% accuracy in correctly classifying synthetic speech.

Index Terms: Speaker recognition, Speech synthesis, Security

Full Paper

Bibliographic reference.  Leon, Phillip L. De / Stewart, Bryan / Yamagishi, Junichi (2012): "Synthetic speech discrimination using pitch pattern statistics derived from image analysis", In INTERSPEECH-2012, 370-373.