8th European Conference on Speech Communication and Technology

Geneva, Switzerland
September 1-4, 2003


On Lexicon Creation for Turkish LVCSR

Kadri Hacioglu (1), Bryan Pellom (1), Tolga Ciloglu (2), Ozlem Ozturk (2), Mikko Kurimo (3), Mathias Creutz (3)

(1) University of Colorado at Boulder, USA
(2) Middle East Technical University, Turkey
(3) Helsinki University of Technology, Finland

In this paper, we address the lexicon design problem in Turkish large vocabulary speech recognition. Although we focus only on Turkish, the methods described here are general enough that they can be considered for other agglutinative languages like Finnish, Korean etc. In an agglutinative language, several words can be created from a single root word using a rich collection of morphological rules. So, a virtually infinite size lexicon is required to cover the language if words are used as the basic units. The standard approach to this problem is to discover a number of primitive units so that a large set of words can be created by compounding those units. Two broad classes of methods are available for splitting words into their sub-units; morphology-based and data-driven methods. Although the word splitting significantly reduces the out of vocabulary rate, it shrinks the context and increases acoustic confusibility. We have used two methods to address the latter. In one method, we use word counts to avoid splitting of high frequency lexical units, and in the other method, we recompound splits according to a probabilistic measure. We present experimental results that show the methods are very effective to lower the word error rate at the expense of lexicon size.

Full Paper

Bibliographic reference.  Hacioglu, Kadri / Pellom, Bryan / Ciloglu, Tolga / Ozturk, Ozlem / Kurimo, Mikko / Creutz, Mathias (2003): "On lexicon creation for turkish LVCSR", In EUROSPEECH-2003, 1165-1168.