This paper describes a simultaneous conversion technique of duration and spectrum based on a statistical model including time-sequence matching. Conventional GMM-based approaches cannot perform spectral conversion taking account of speaking rate because it assumes one to one frame matching between source and target features. However, speaker characteristics may appear in speaking rates. In order to perform duration conversion, we attach duration models to statistical models including timesequence matching (DPGMM). Since DPGMM can represent two different length sequences directly, the conversion of spectrum and duration can be performed within an integrated framework. In the proposed technique, each mixture component of DPGMM has different duration transformation functions, therefore durations are converted nonlinearly and dependently on spectral information. In the subjective DMOS test, the proposed method is superior to the conventional method.
Bibliographic reference. Yutani, Kaori / Uto, Yosuke / Nankaku, Yoshihiko / Toda, Tomoki / Tokuda, Keiichi (2008): "Simultaneous conversion of duration and spectrum based on statistical models including time-sequence matching", In INTERSPEECH-2008, 1072-1075.