Predicting Pronunciations with Syllabification and Stress with Recurrent Neural Networks

Daan van Esch, Mason Chua, Kanishka Rao


Word pronunciations, consisting of phoneme sequences and the associated syllabification and stress patterns, are vital for both speech recognition and text-to-speech (TTS) systems. For speech recognition phoneme sequences for words may be learned from audio data. We train recurrent neural network (RNN) based models to predict the syllabification and stress pattern for such pronunciations making them usable for TTS. We find these RNN models significantly outperform naive rule-based models for almost all languages we tested. Further, we find additional improvements to the stress prediction model by using the spelling as features in addition to the phoneme sequence. Finally, we train a single RNN model to predict the phoneme sequence, syllabification and stress for a given word. For several languages, this single RNN outperforms similar models trained specifically for either phoneme sequence or stress prediction. We report an exhaustive comparison of these approaches for twenty languages.


DOI: 10.21437/Interspeech.2016-1419

Cite as

Esch, D.v., Chua, M., Rao, K. (2016) Predicting Pronunciations with Syllabification and Stress with Recurrent Neural Networks. Proc. Interspeech 2016, 2841-2845.

Bibtex
@inproceedings{Esch+2016,
author={Daan van Esch and Mason Chua and Kanishka Rao},
title={Predicting Pronunciations with Syllabification and Stress with Recurrent Neural Networks},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-1419},
url={http://dx.doi.org/10.21437/Interspeech.2016-1419},
pages={2841--2845}
}