SLTU-2008 - First International Workshop on Spoken Languages Technologies for Under-Resourced Languages

Hanoi, Vietnam
May 5-7, 2008

Which Units for Acoustic and Language Modeling for Khmer Automatic Speech Recognition?

Sopheap Seng (1,2), Sethserey Sam (1,2), Viet-Bac Le (1), Brigitte Bigi (1), Laurent Besacier (1)

(1) LIG Laboratory, UMR 5217, Grenoble, France
(2) International Research Center MICA, CNRS/UMI-2954, Hanoi, Vietnam

In this paper we present an overview on the development of a large vocabulary continuous speech recognition system for Khmer language. Methods and tools used for quick language resources collection for the development of an ASR system for a new under-resourced language are presented. Face with the problem of lack of text data and the word error segmentation in language modeling, we investigate how different views of the text data (word and sub-word units) can be exploited for Khmer language modeling. We propose to work both at the model level (by making hybrid vocabularies with both word and sub-word units) as well as at the ASR output level (by using a simple N-best list voting mechanism). For acoustic modeling, we use basic linguistic rules to automatically generate pronunciation dictionaries based on grapheme and phoneme. An experimental framework is setup to evaluate the performance of each modeling units.

Index Terms - ASR, Khmer, word and sub-word units, acoustic modeling, language modeling.

Bibliographic reference.  Seng, Sopheap / Sam, Sethserey / Le, Viet-Bac / Bigi, Brigitte / Besacier, Laurent (2008): "WHICH UNITS FOR ACOUSTIC AND LANGUAGE MODELING FOR KHMER AUTOMATIC SPEECH RECOGNITION?", In SLTU-2008, 33-38.