Duration Modeling with Global Phoneme-Duration Vectors

Jinfu Ni, Yoshinori Shiga, Hisashi Kawai

A duration model is a major component in every parametric speech synthesis system. Conventional methods use full contextual labels as features to predict phoneme durations that require morphological analysis of text. By contrast, advances in bidirectional recurrent neural networks (BRNN) and global space vector models make it possible to perform grapheme-to-phoneme (G2P) conversion from plain text. In this paper, we investigate duration prediction from plain phonemes instead of using their full contextual labels. We propose a new approach that relies on both BRNN and global space vector representations of phonemes (GPV) and durations (GDV). GPVs represent the statistics of phonemes used in a language, whereas GDVs capture duration variations beyond linguistic features. They are essentially learned from a large-scale text corpus in an unsupervised manner where phonemes are converted by G2P.

We conducted experiments on two speech corpora in Korean and Chinese to train BRNN-based models in a supervised manner. An objective evaluation conducted on a set of test sentences demonstrated that the proposed method leads to more accurate modeling of phoneme durations than the baselines.

 DOI: 10.21437/Interspeech.2019-2126

Cite as: Ni, J., Shiga, Y., Kawai, H. (2019) Duration Modeling with Global Phoneme-Duration Vectors. Proc. Interspeech 2019, 4465-4469, DOI: 10.21437/Interspeech.2019-2126.

  author={Jinfu Ni and Yoshinori Shiga and Hisashi Kawai},
  title={{Duration Modeling with Global Phoneme-Duration Vectors}},
  booktitle={Proc. Interspeech 2019},