7th International Conference on Spoken Language Processing

September 16-20, 2002
Denver, Colorado, USA

A Comparison of Four Language Models for Large Vocabulary Turkish Speech Recognition

Helin Dutagaci, Levent M. Arslan

Bogazici University, Turkey

This paper gives a comparison of three language models proposed as alternatives to word-based language model for large vocabulary speech recognition of Turkish. Turkish is an agglutinative language and has morphological productivity. This results in a huge vocabulary size and a large number of out of vocabulary words for unseen test data. The solution is to parse the words, in order to get smaller base units, which are capable of covering the language with relatively small vocabulary size. Three different ways of decomposing words into base units are described: Morphem-based model, stem-endingbased model and syllable-based model. These models are compared with respect to vocabulary size, coverage, number of out of vocabulary words, perplexity and sensitivity to context. For all three models, a significant improvement for those measures are observed compared to the word-based language model.


Full Paper

Bibliographic reference.  Dutagaci, Helin / Arslan, Levent M. (2002): "A comparison of four language models for large vocabulary turkish speech recognition", In ICSLP-2002, 729-732.