Most speech recognition systems rely on pronunciation dictionaries to provide accurate transcriptions. Typically, some pronunciations are carved manually, but many are produced using pronunciation learning algorithms. Successful algorithms must have the ability to generate rich pronunciation variants, e.g. to accommodate words of foreign origin, while being robust to artifacts of the training data, e.g. noise in the acoustic segments from which the pronunciations are learned if the method uses acoustic signals. We propose a general finite-state transducer (FST) framework to describe such algorithms. This representation is flexible enough to accommodate a wide variety of pronunciation learning algorithms, including approaches that rely on the availability of acoustic data, and methods that only rely on the spelling of the target words. In particular, we show that the pronunciation FST can be built from a recurrent neural network (RNN) and tuned to provide rich yet constrained pronunciations. This new approach reduces the number of incorrect pronunciations learned from Google Voice traffic by up to 25% relative.
Cite as: Bruguier, A., Gnanapragasam, D., Johnson, L., Rao, K., Beaufays, F. (2017) Pronunciation Learning with RNN-Transducers. Proc. Interspeech 2017, 2556-2560, doi: 10.21437/Interspeech.2017-47
@inproceedings{bruguier17_interspeech, author={Antoine Bruguier and Danushen Gnanapragasam and Leif Johnson and Kanishka Rao and Françoise Beaufays}, title={{Pronunciation Learning with RNN-Transducers}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={2556--2560}, doi={10.21437/Interspeech.2017-47} }