8th International Conference on Spoken Language Processing

Jeju Island, Korea
October 4-8, 2004

N-gram Language Modeling of Japanese Using Bunsetsu Boundaries

Sungyup Chung, Keikichi Hirose, Nobuaki Minematsu

University of Tokyo, Japan

A new scheme of N-gram language modeling was proposed for Japanese, where word N-grams were calculated separately for the two cases: crossing and not crossing bunsetsu boundaries. Here, bunsetsu is a basic grammatical (and pronunciation) unit of Japanese. Similar scheme using accent phrase boundaries instead of bunsetsu boundaries has already been proposed by the authors with a certain success, but it suffered from the training data shortage, because assignment of accent phrase boundaries requires a speech corpus. In contrast, bunsetsu boundaries can be detected automatically from a written text with a rather high accuracy using parsers. Experiments showed that perplexity reduction and word recognition rate improvement, especially in case of small training corpus, were possible by estimating bunsetsu boundaries from the history longer than N-1 words in the case of N-gram modeling and by selecting one from two types of models (crossing and not crossing bunsetsu boundaries) according to the estimation.

Full Paper

Bibliographic reference.  Chung, Sungyup / Hirose, Keikichi / Minematsu, Nobuaki (2004): "N-gram language modeling of Japanese using bunsetsu boundaries", In INTERSPEECH-2004, 993-996.