Voice Conversion without Explicit Separation of Source and Filter Components Based on Non-negative Matrix Factorization

Hitoshi Suda, Daisuke Saito, Nobuaki Minematsu


This paper introduces a new voice conversion (VC) technique which performs spectrogram-to-spectrogram conversion. Conventional studies on VC focus on spectral envelopes, which represent vocal tract information. While vocoders have enabled light-weight and high-quality synthesis from the features, flexibility and quality is still limited by parameterization. To overcome the limitation, this paper aims to model and convert spectrograms themselves. In general, spectrograms are too complicated to be modeled because they contain not only spectral envelopes but also source structures. This paper adopts source-filter non-negative matrix factorization (SF-NMF) as a conversion model of spectrograms. SF-NMF is an extended model of non-negative matrix factorization (NMF), and models source and filter components jointly without explicit separation. The proposed method generates waveforms by reconstructing phase information from amplitude spectrograms. Since SFNMF requests log-frequency spectrograms, the method utilizes scalograms, which are obtained by continuous wavelet transform (CWT). Experimental results showed the proposed method achieved spectrogram-to-spectrogram speaker conversion.


 DOI: 10.21437/SSW.2019-13

Cite as: Suda, H., Saito, D., Minematsu, N. (2019) Voice Conversion without Explicit Separation of Source and Filter Components Based on Non-negative Matrix Factorization. Proc. 10th ISCA Speech Synthesis Workshop, 69-74, DOI: 10.21437/SSW.2019-13.


@inproceedings{Suda2019,
  author={Hitoshi Suda and Daisuke Saito and Nobuaki Minematsu},
  title={{Voice Conversion without Explicit Separation of Source and Filter Components Based on Non-negative Matrix Factorization}},
  year=2019,
  booktitle={Proc. 10th ISCA Speech Synthesis Workshop},
  pages={69--74},
  doi={10.21437/SSW.2019-13},
  url={http://dx.doi.org/10.21437/SSW.2019-13}
}