9th Annual Conference of the International Speech Communication Association

Brisbane, Australia
September 22-26, 2008

A Language-Modeling Approach to Inverse Text Normalization and Data Cleanup for Multimodal Voice Search Applications

Yun-Cheng Ju (1), Julian Odell (2)

(1) Microsoft Research, USA; (2) Microsoft Corporation, USA

In this paper we address two related challenges in multimodal local search applications on mobile devices: first, correctly displaying the business names, and second, harvesting language model training data from an inconsistently labeled corpus. We investigate the impact of common text normalization and the quality of language model training corpus on the accuracy of displayed results. We propose a new language model framework that eliminates the need for explicit inverse text normalization. The same framework can be applied to sift through corrupted language model training data. Our new language model is 25% more accurate while 25% smaller in size.

Full Paper

Bibliographic reference.  Ju, Yun-Cheng / Odell, Julian (2008): "A language-modeling approach to inverse text normalization and data cleanup for multimodal voice search applications", In INTERSPEECH-2008, 2179-2182.