Speech synthesis and voice conversion techniques can pose threats to current speaker verification (SV) systems. For this purpose, it is essential to develop front end systems that are able to distinguish human speech vs. spoofed speech (synthesized or voice converted). In this paper, for the ASVspoof 2015 challenge, we propose a detector based on combination of cochlear filter cepstral coefficients (CFCC) and change in instantaneous frequency (IF), (i.e., CFCCIF) to detect natural vs. spoofed speech. The CFCCIF features were extracted at frame-level and Gaussian mixture model (GMM)-based classification system was used. On the development set, the proposed features (i.e., CFCCIF) after fusion with Mel frequency cepstral coefficients (MFCC) features achieved an EER of 1.52%, which is a significant reduction from MFCC (3.26%) and CFCCIF (2.29%) alone using 12-D static features. The EER further decreases to 0.89% and 0.83% for delta and delta-delta features, respectively. Experimental results on evaluation set show that fusion of MFCC and CFCCIF works relatively well with an EER of 0.41% for known attacks and 2.013% EER for unknown attacks. On an average, fusion of MFCC and CFCCIF features provided relatively best EER of 1.211% for the challenge.
Bibliographic reference. Patel, Tanvina B. / Patil, Hemant A. (2015): "Combining evidences from mel cepstral, cochlear filter cepstral and instantaneous frequency features for detection of natural vs. spoofed speech", In INTERSPEECH-2015, 2062-2066.