ISCA Archive Interspeech 2008
ISCA Archive Interspeech 2008

On the impact of alignment on voice conversion performance

Elina Helander, Jan Schwarz, Jani Nurminen, Hanna Silen, Moncef Gabbouj

Most of the current voice conversion systems model the joint density of source and target features using a Gaussian mixture model. An inherent property of this approach is that the source and target features have to be properly aligned for the training. It is intuitively clear that the accuracy of the alignment has some effect on the conversion quality but this issue has not been thoroughly studied in the literature. Examples of alignment techniques include the usage of a speech recognizer with forced alignment or dynamic time warping (DTW). In this paper, we study the effect of alignment on voice conversion quality through extensive experiments and discuss issues that should be considered. The main outcome of the study is that alignment clearly matters but with simple voice activity detection, DTW and some constraints we can achieve the same quality as with hand-marked labels.

doi: 10.21437/Interspeech.2008-419

Cite as: Helander, E., Schwarz, J., Nurminen, J., Silen, H., Gabbouj, M. (2008) On the impact of alignment on voice conversion performance. Proc. Interspeech 2008, 1453-1456, doi: 10.21437/Interspeech.2008-419

  author={Elina Helander and Jan Schwarz and Jani Nurminen and Hanna Silen and Moncef Gabbouj},
  title={{On the impact of alignment on voice conversion performance}},
  booktitle={Proc. Interspeech 2008},