ISCA Archive Interspeech 2008
ISCA Archive Interspeech 2008

Simultaneous conversion of duration and spectrum based on statistical models including time-sequence matching

Kaori Yutani, Yosuke Uto, Yoshihiko Nankaku, Tomoki Toda, Keiichi Tokuda

This paper describes a simultaneous conversion technique of duration and spectrum based on a statistical model including time-sequence matching. Conventional GMM-based approaches cannot perform spectral conversion taking account of speaking rate because it assumes one to one frame matching between source and target features. However, speaker characteristics may appear in speaking rates. In order to perform duration conversion, we attach duration models to statistical models including timesequence matching (DPGMM). Since DPGMM can represent two different length sequences directly, the conversion of spectrum and duration can be performed within an integrated framework. In the proposed technique, each mixture component of DPGMM has different duration transformation functions, therefore durations are converted nonlinearly and dependently on spectral information. In the subjective DMOS test, the proposed method is superior to the conventional method.


doi: 10.21437/Interspeech.2008-331

Cite as: Yutani, K., Uto, Y., Nankaku, Y., Toda, T., Tokuda, K. (2008) Simultaneous conversion of duration and spectrum based on statistical models including time-sequence matching. Proc. Interspeech 2008, 1072-1075, doi: 10.21437/Interspeech.2008-331

@inproceedings{yutani08_interspeech,
  author={Kaori Yutani and Yosuke Uto and Yoshihiko Nankaku and Tomoki Toda and Keiichi Tokuda},
  title={{Simultaneous conversion of duration and spectrum based on statistical models including time-sequence matching}},
  year=2008,
  booktitle={Proc. Interspeech 2008},
  pages={1072--1075},
  doi={10.21437/Interspeech.2008-331}
}