ISCA Archive Odyssey 2022
ISCA Archive Odyssey 2022

Speaker-Targeted Synthetic Speech Detection

Diego Castan, Md Hafizur Rahman, Sarah Bakst, Chris Cobo-Kroenke, Mitchell McLaren, Martin Graciarena, Aaron Lawson

Text-to-speech (TTS) and voice conversion (VC) technologies are evolving quickly towards realistic-sounding human-like voices. As this technology improves, so does the opportunity for malpractice in speaker identification (SID) via spoofing, the process of impersonating a voice biometric via synthesis. More data typically equates to a more realistic voice model, which poses an issue for well-known subjects, such as politicians and celebrities, who have vast amounts of multimedia available online. Detection of synthetic speech has relied on signal processing techniques that focus on the generation of new acoustic features and train deep-learning models to detect when an audio file has been manipulated through the characterization of unnatural changes or artifacts. However, these techniques do not use any information from the speaker they are evaluating. This paper proposes to incorporate information from the speaker-of-interest (SoI) into the models to avoid specific spoofing attacks for certain vulnerable people as a logical access (LA) control tool. The wealth of data for well-known people can also be used to train a speaker-specific spoofing detector with a higher level of accuracy than a speaker-independent model. The paper proposes a new xResNet-PLDA system and compares it to three different baseline systems: a state-of-the-art speaker identification system, an xResNet system trained to discriminate between bonafide and fake speech, and a speaker identification system in which the PLDA and calibration models were trained with bonafide and fake speech. We evaluated the systems in two different scenarios — a cross-validation scenario and a holdout scenario — with three different databases. We show how the proposed system outperforms dramatically the baseline systems in each scenario and for each database. Finally, we show how using a small amount of the SoI’s speech to adapt global calibration parameters improves the performance of the system, especially in unseen conditions.

doi: 10.21437/Odyssey.2022-9

Cite as: Castan, D., Rahman, M.H., Bakst, S., Cobo-Kroenke, C., McLaren, M., Graciarena, M., Lawson, A. (2022) Speaker-Targeted Synthetic Speech Detection. Proc. The Speaker and Language Recognition Workshop (Odyssey 2022), 62-69, doi: 10.21437/Odyssey.2022-9

  author={Diego Castan and Md Hafizur Rahman and Sarah Bakst and Chris Cobo-Kroenke and Mitchell McLaren and Martin Graciarena and Aaron Lawson},
  title={{Speaker-Targeted Synthetic Speech Detection}},
  booktitle={Proc. The Speaker and Language Recognition Workshop (Odyssey 2022)},