We propose a system to predict baseform-generation errors in a text-to-speech (TTS) front-end, and aid in the process of customizing the synthesis engine to a novel application with a large, open-ended vocabulary. We motivate the use of the system by using data collected during the deployment of the IBM TTS engine in the Watson Deep Question-Answering system customized to play a game of Jeopardy!. We propose a set of features derived from a lexeme's orthography and candidate baseform, and use a variety of learning schemes and data sampling algorithms to address the issue of skewed class priors in the training data. We show that 1) these different approaches provide complementary information that can then be exploited by fusion schemes to improve on the baseline performances, and 2) it is possible to use these techniques to retrieve a list of likely incorrect lexemes so as to reduce the number of tokens that must be vetted before finding and fixing an error.
Bibliographic reference. Rosenberg, Andrew / Fernandez, Raul / Ramabhadran, Bhuvana (2011): "“what is… dengue fever?” - modeling and predicting pronunciation errors in a text-to-speech system", In INTERSPEECH-2011, 2189-2192.