Learning Structured Dictionaries for Exemplar-based Voice Conversion

Shaojin Ding, Christopher Liberatore, Ricardo Gutierrez-Osuna


Incorporating phonetic information has been shown to improve the performance of exemplar-based voice conversion. A standard approach is to build a phonetically structured dictionary, where exemplars are categorized into sub-dictionaries according to their phoneme labels. However, acquiring phoneme labels can be expensive and the phoneme labels can have inaccuracies. The latter problem becomes more salient when the speakers are non-native speakers. This paper presents an iterative dictionary-learning algorithm that avoids the need for phoneme labels and instead learns the structured dictionaries in an unsupervised fashion. At each iteration, two steps are alternatively performed: cluster update and dictionary update. In the cluster update step, each training frame is assigned to a cluster whose sub-dictionary represents it with the lowest residual. In the dictionary update step, the sub-dictionary for a cluster is updated using all the speech frames in the cluster. We evaluate the proposed algorithm through objective and subjective experiments on a new corpus of non-native English speech. Compared to previous studies, the proposed algorithm improves the acoustic quality of voice-converted speech while retaining the target speaker’s identity.


 DOI: 10.21437/Interspeech.2018-1295

Cite as: Ding, S., Liberatore, C., Gutierrez-Osuna, R. (2018) Learning Structured Dictionaries for Exemplar-based Voice Conversion. Proc. Interspeech 2018, 481-485, DOI: 10.21437/Interspeech.2018-1295.


@inproceedings{Ding2018,
  author={Shaojin Ding and Christopher Liberatore and Ricardo Gutierrez-Osuna},
  title={Learning Structured Dictionaries for Exemplar-based Voice Conversion},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={481--485},
  doi={10.21437/Interspeech.2018-1295},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1295}
}