One of the problems in text-to-speech (TTS) systems and speech-totext (STT) systems is pronunciation estimation of unknown words. In this paper, we propose a method for extracting unknown words and their pronunciations from similar sets of Japanese text data and speech data. Out-of-vocabulary words are extracted from text with a stochastic model and pronunciations hypotheses are generated. These entries are verified by conducting automatic speech recognition on audio data. In this work, we use news articles and broadcast TV news covering similar topics. Most extracted pairs turned out to be correct according to a human judges. We also tested the TTS front-end enhanced with these entries on other web news articles, and observed an improvement in the pronunciation estimation accuracy of 9.2% (relative). The proposed method can be used to realize a spoken language processing system that acquires and updates its lexicon automatically.
Bibliographic reference. Sasada, Tetsuro / Mori, Shinsuke / Kawahara, Tatsuya (2008): "Extracting word-pronunciation pairs from comparable set of text and speech", In INTERSPEECH-2008, 1821-1824.