8th International Conference on Spoken Language Processing

Jeju Island, Korea
October 4-8, 2004

Word N-gram Probability Estimation from a Japanese Raw Corpus

Shinsuke Mori, Daisuke Takuma

IBM Japan, Ltd., Japan

Statistical language modeling plays an important role in a state-of-the-art speech recognizer. The most used language model (LM) is word n-gram model, which is based on the frequency of words and word sequences in a corpus. In various Asian languages, however, words are not delimited by whitespace, so we need to annotate sentences with word boundary information to prepare a statistically reliable large corpus. In this paper, we explain a method for building an LM directly from a raw corpus. In this method, sentences in the raw corpus are regarded as sentences annotated with stochastic word boundary information. In the experiments, we compared the predictive powers of an LM built only from a segmented corpus and an LM built from the segmented corpus and a raw corpus. The result showed that we succeeded in reducing the perplexity by 42.9% using a raw corpus by our method.

Full Paper

Bibliographic reference.  Mori, Shinsuke / Takuma, Daisuke (2004): "Word n-gram probability estimation from a Japanese raw corpus", In INTERSPEECH-2004, 1365-1368.