INTERSPEECH 2008
9th Annual Conference of the International Speech Communication Association

Brisbane, Australia
September 22-26, 2008

On the Impact of Alignment on Voice Conversion Performance

Elina Helander (1), Jan Schwarz (2), Jani Nurminen (3), Hanna Silen (1), Moncef Gabbouj (1)

(1) Tampere University of Technology, Finland; (2) Christian-Albrechts-Universität zu Kiel, Germany; (3) Nokia Devices R&D, Finland

Most of the current voice conversion systems model the joint density of source and target features using a Gaussian mixture model. An inherent property of this approach is that the source and target features have to be properly aligned for the training. It is intuitively clear that the accuracy of the alignment has some effect on the conversion quality but this issue has not been thoroughly studied in the literature. Examples of alignment techniques include the usage of a speech recognizer with forced alignment or dynamic time warping (DTW). In this paper, we study the effect of alignment on voice conversion quality through extensive experiments and discuss issues that should be considered. The main outcome of the study is that alignment clearly matters but with simple voice activity detection, DTW and some constraints we can achieve the same quality as with hand-marked labels.

Full Paper

Bibliographic reference.  Helander, Elina / Schwarz, Jan / Nurminen, Jani / Silen, Hanna / Gabbouj, Moncef (2008): "On the impact of alignment on voice conversion performance", In INTERSPEECH-2008, 1453-1456.