Multitask Sequence-to-Sequence Models for Grapheme-to-Phoneme Conversion

Benjamin Milde, Christoph Schmidt, Joachim Köhler


Recently, neural sequence-to-sequence (Seq2Seq) models have been applied to the problem of grapheme-to-phoneme (G2P) conversion. These models offer a straightforward way of modeling the conversion by jointly learning the alignment and translation of input to output tokens in an end-to-end fashion. However, until now this approach did not show improved error rates on its own compared to traditional joint-sequence based n-gram models for G2P. In this paper, we investigate how multitask learning can improve the performance of Seq2Seq G2P models. A single Seq2Seq model is trained on multiple phoneme lexicon datasets containing multiple languages and phonetic alphabets. Although multi-language learning does not show improved error rates, combining standard datasets and crawled data with different phonetic alphabets of the same language shows promising error reductions on English and German Seq2Seq G2P conversion. Finally, combining Seq2seq G2P models with standard n-grams based models yields significant improvements over using either model alone.


 DOI: 10.21437/Interspeech.2017-1436

Cite as: Milde, B., Schmidt, C., Köhler, J. (2017) Multitask Sequence-to-Sequence Models for Grapheme-to-Phoneme Conversion. Proc. Interspeech 2017, 2536-2540, DOI: 10.21437/Interspeech.2017-1436.


@inproceedings{Milde2017,
  author={Benjamin Milde and Christoph Schmidt and Joachim Köhler},
  title={Multitask Sequence-to-Sequence Models for Grapheme-to-Phoneme Conversion},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={2536--2540},
  doi={10.21437/Interspeech.2017-1436},
  url={http://dx.doi.org/10.21437/Interspeech.2017-1436}
}