Multilingual Grapheme-to-Phoneme Conversion with Global Character Vectors

Jinfu Ni, Yoshinori Shiga, Hisashi Kawai

Multilingual grapheme-to-phoneme (G2P) models are useful for multilingual speech synthesis because one model simultaneously copes with multilingual words. We propose a G2P model that combines global character vectors (GCVs) with bidirectional recurrent neural networks (BRNNs) and enables the direct conversion of text (as a sequence of characters) to pronunciation. GCVs are distributional, real-valued representations of characters and their contextual interactions that can be learned from a large-scale text corpus in an unsupervised manner. With the flexibility of learning GCVs from plain text resources, this method has an advantage: it enables monolingual G2P (MoG2P) and multilingual G2P (MuG2P) conversion. We experiment in four languages (Japanese, Korean, Thai and Chinese) with learning language-dependent (LD) and language-independent (LI) GCVs and then build MoG2P and MuG2P models with two-hidden-layer BRNNs. Our results show that both LD- and LI-GCV-based MoG2P models, whose performances are equivalent, achieved better than 97.7% syllable accuracy, which is a relative improvement from 27% to 90% depending on the language in comparison with Mecab-based models. As for MuG2P, the accuracy is around 98%, which is a slightly degraded performance compared to MoG2P. The proposed method also has the potential of the G2P conversion of non-normalized words, achieving 80% accuracy in Japanese.

 DOI: 10.21437/Interspeech.2018-1626

Cite as: Ni, J., Shiga, Y., Kawai, H. (2018) Multilingual Grapheme-to-Phoneme Conversion with Global Character Vectors. Proc. Interspeech 2018, 2823-2827, DOI: 10.21437/Interspeech.2018-1626.

  author={Jinfu Ni and Yoshinori Shiga and Hisashi Kawai},
  title={Multilingual Grapheme-to-Phoneme Conversion with Global Character Vectors},
  booktitle={Proc. Interspeech 2018},