In this paper we address two related challenges in multimodal local search applications on mobile devices: first, correctly displaying the business names, and second, harvesting language model training data from an inconsistently labeled corpus. We investigate the impact of common text normalization and the quality of language model training corpus on the accuracy of displayed results. We propose a new language model framework that eliminates the need for explicit inverse text normalization. The same framework can be applied to sift through corrupted language model training data. Our new language model is 25% more accurate while 25% smaller in size.
Bibliographic reference. Ju, Yun-Cheng / Odell, Julian (2008): "A language-modeling approach to inverse text normalization and data cleanup for multimodal voice search applications", In INTERSPEECH-2008, 2179-2182.