Good language modeling relies on good predefined lexicons. For Chinese, since there are no text word boundaries and the concept of "word" is not very well defined, constructing good lexicons is difficult. In this paper, we propose lexicon adaptation with reduced character error (LARCE), which learns new word tokens based on the criterion of reduced adaptation corpus error rate. In this approach, a multi-character string is taken as a new "word" as long as it is helpful in reducing the error rate, and minimum number of new, high-quality words can be obtained. This algorithm is based on character-based consensus networks. In initial experiments on Chinese broadcast news, it is shown that LARCE not only significantly outperforms PAT-tree-based word extraction algorithms, but even outperforms manually augmented lexicons. It is believed the concept is equally useful for other character-based languages.
Bibliographic reference. Pan, Yi-cheng / Lee, Lin-shan (2007): "Lexicon adaptation with reduced character error (LARCE) - a new direction in Chinese language modeling", In INTERSPEECH-2007, 610-613.