Current speaker verification systems are vulnerable to advanced speech manipulation techniques such as voice conversion and speaker adaptation for TTS systems. Effective anti-spoofing systems that allow the discrimination between human and synthetic impostors have been developed. However, many of them still present two main drawbacks: speaker dependency and, more importantly, counterfeiting technique dependency. Thus, getting a universal synthetic speech detector (SSD) remains an open issue. This paper explores the feasibility of such a system using a statistical classifier for human and synthetic speech. Provided the great diversity of counterfeiting techniques, we have chosen to model a variety of state-of-the-art minimum-phase vocoders, creating imposter synthetic signals by copy-synthesis. Two speech parameter sets are used: MFCCs as a canonical baseline and relative phase shift (RPS) based parameterization. Phase related parameters allow synthetic speech detection based on the presumably different phase structures of the human and synthetic signals due to the fact that most speech synthesis and conversion techniques disregard phase information. The results of the experiments show that speaker independent classifiers perform very well for every vocoder. Cross-vocoder experiments show that the system is highly dependent on the type of vocoder, and that RPS parameterization performs better than MFCC for multi-vocoder models.
Bibliographic reference. Sanchez, Jon / Saratxaga, Ibon / Hernaez, Inma / Navas, Eva / Erro, Daniel (2014): "A cross-vocoder study of speaker independent synthetic speech detection using phase information", In INTERSPEECH-2014, 1663-1667.