V2S attack: building DNN-based voice conversion from automatic speaker verification

Taiki Nakamura, Yuki Saito, Shinnosuke Takamichi, Yusuke Ijima, Hiroshi Saruwatari

This paper presents a new voice impersonation attack using voice conversion (VC). Enrolling personal voices for automatic speaker verification (ASV) offers natural and flexible biometric authentication systems. Basically, the ASV systems do not include the users’ voice data. However, if the ASV system is unexpectedly exposed and hacked by a malicious attacker, there is a risk that the attacker will use VC techniques to reproduce the enrolled user’s voices. We name this the “verification-to-synthesis (V2S) attack” and propose VC training with the ASV and pre-trained automatic speech recognition (ASR) models and without the targeted speaker’s voice data. The VC model reproduces the targeted speaker’s individuality by deceiving the ASV model and restores phonetic property of an input voice by matching phonetic posteriorgrams predicted by the ASR model. The experimental evaluation compares converted voices between the proposed method that does not use the targeted speaker’s voice data and the standard VC that uses the data. The experimental results demonstrate that the proposed method performs comparably to the existing VC methods that trained using a very small amount of parallel voice data.

 DOI: 10.21437/SSW.2019-29

Cite as: Nakamura, T., Saito, Y., Takamichi, S., Ijima, Y., Saruwatari, H. (2019) V2S attack: building DNN-based voice conversion from automatic speaker verification. Proc. 10th ISCA Speech Synthesis Workshop, 161-165, DOI: 10.21437/SSW.2019-29.

  author={Taiki Nakamura and Yuki Saito and Shinnosuke Takamichi and Yusuke Ijima and Hiroshi Saruwatari},
  title={{V2S attack: building DNN-based voice conversion from automatic speaker verification}},
  booktitle={Proc. 10th ISCA Speech Synthesis Workshop},