11th Annual Conference of the International Speech Communication Association

Makuhari, Chiba, Japan
September 26-30. 2010

Text Normalization Based on Statistical Machine Translation and Internet User Support

Tim Schlippe, Chenfei Zhu, Jan Gebhardt, Tanja Schultz

Cognitive Systems Lab, Karlsruhe Institute of Technology (KIT), Germany

In this paper, we describe and compare systems for text normalization based on SMT methods which are constructed with the support of internet users. By normalizing text displayed in a web interface, internet users provide a parallel corpus of normalized and non-normalized text. With this corpus, SMT models are generated to translate non-normalized into normalized text. To build traditional language-specific text normalization systems, knowledge of linguistics as well as established computer skills to implement text normalization rules are required. Our systems are built without profound computer knowledge due to the simple self-explanatory user interface and the automatic generation of the SMT models. Additionally, no inhouse knowledge of the language to normalize is required due to the multilingual expertise of the internet community. All techniques are applied on French texts, crawled with our Rapid Language Adaptation Toolkit [1] and compared through Levenshtein edit distance, BLEU score and perplexity. [1] Tanja Schultz and Alan Black. Rapid Language Adaptation Tools and Technologies for Multilingual Speech Processing. In: Proc. ICASSP Las Vegas, NV 2008.

Full Paper

Bibliographic reference.  Schlippe, Tim / Zhu, Chenfei / Gebhardt, Jan / Schultz, Tanja (2010): "Text normalization based on statistical machine translation and internet user support", In INTERSPEECH-2010, 1816-1819.