Sixth International Conference on Spoken Language Processing
(ICSLP 2000)

Beijing, China
October 16-20, 2000

Semi-Automatic Language Model Acquisition without Large Corpora

Tomoyosi Akiba, Katsunobu Itou

Electrotechnical Laboratory, AIST, MITI, Japan

In this paper, we discuss a methodology for the development of a language model for speech recognition, and introduce a semi-automatic method of acquiring a language model, which does not require large corpora.

Statistical language models have gained a reputation as providing the overall performance for speech recognition, and so widely used in speech recognition systems today. The tasks to which statistical language models can be applied are, however, limited, because a large corpus is essential for the building of a statistical model, and the collection of a new corpus is a very costly task in terms of time and effort. Thus, if our aim is to apply speech recognition to various tasks as required, we need a way of developing a new language model for a given task at a reasonable cost.

On the other hand, our new method is structured so that it can attempt to acquire language models from various knowledge resources. Each knowledge resource makes its own contribution to the acquired language model. For example, novice users may specify sequences of words that are and are not sentences. Experts can specify the constituents that makes a sentence, that is, what is often called grammatical knowledge. Most electronic dictionaries available today carry information about words, including part-of-speech, inflection patterns, semantic class, and so on. Of course, a corpus is considered as one of knowledge resources. In addition, we must consider about speech recognition systems; the acquired language model should be used by them.

To integrate information from such a range of knowledge resources, a uniform representation is essential. In section 2, a specific class of attribute grammars is introduced for this purpose.

In section 3, we introduce a semi-automatic method to acquire a Japanese language model for any new task as required. The EDR electronic dictionary [1], an existing electronic dictionary of the Japanese language, and a small set of example sentences which are intended to convey the characteristics of the task, are used instead of a large corpus. Our method is also intended to utilize the knowledge of experts as much as possible.

Reference

  1. Japan Electronic Dictionary Research Institute, Ltd. EDR Electronic Dictionary Technical Guide. TR-042, 1993.


Full Paper

Bibliographic reference.  Akiba, Tomoyosi / Itou, Katsunobu (2000): "Semi-automatic language model acquisition without large corpora", In ICSLP-2000, vol.4, 49-52.