Global Syllable Vectors for Building TTS Front-End with Deep Learning

Jinfu Ni, Yoshinori Shiga, Hisashi Kawai

Recent vector space representations of words have succeeded in capturing syntactic and semantic regularities. In the context of text-to-speech (TTS) synthesis, a front-end is a key component for extracting multi-level linguistic features from text, where syllable acts as a link between low- and high-level features. This paper describes the use of global syllable vectors as features to build a front-end, particularly evaluated in Chinese. The global syllable vectors directly capture global statistics of syllable-syllable co-occurrences in a large-scale text corpus. They are learned by a global log-bilinear regression model in an unsupervised manner, whilst the front-end is built using deep bidirectional recurrent neural networks in a supervised fashion. Experiments are conducted on large-scale Chinese speech and treebank text corpora, evaluating grapheme to phoneme (G2P) conversion, word segmentation, part of speech (POS) tagging, phrasal chunking, and pause break prediction. Results show that the proposed method is efficient for building a compact and robust front-end with high performance. The global syllable vectors can be acquired relatively cheaply from plain text resources, therefore, they are vital to develop multilingual speech synthesis, especially for under-resourced language modeling.

 DOI: 10.21437/Interspeech.2017-669

Cite as: Ni, J., Shiga, Y., Kawai, H. (2017) Global Syllable Vectors for Building TTS Front-End with Deep Learning. Proc. Interspeech 2017, 769-773, DOI: 10.21437/Interspeech.2017-669.

  author={Jinfu Ni and Yoshinori Shiga and Hisashi Kawai},
  title={Global Syllable Vectors for Building TTS Front-End with Deep Learning},
  booktitle={Proc. Interspeech 2017},