2nd Workshop on Spoken Language Technologies for Under-Resourced Languages

Universiti Sains, Penang, Malaysia
May 3-5, 2010

Malay Language Modeling in Large Vocabulary Continuous Speech Recognition with Linguistic Information

Hong Kai Sze (1,2), Tan Tien Ping (2), Tang Enya Kong (3), Cheah Yu-N (2)

(1) Faculty of Engineering & Science, Universiti Tunku Abdul Rahman, Kuala Lumpur, Malaysia
(2) School of Computer Sciences, Universiti Sains Malaysia, Penang, Malaysia
(3) Universiti Multimedia, Cyberjaya, Malaysia

In this paper, our recent progress in developing and evaluating Malay Large Vocabulary Continuous Speech Recognizer (LVCSR) with considerations of linguistic information is discussed. The best baseline system has a WER of 15.8%. In order to propose methods to improve the accuracies further, additional experiments have been performed using linguistic information such as part-ofspeech and stem. We have also tested our system by creating a language model using a small amount of texts and suggested that linguistic knowledge can be used to improve the accuracy of Malay automatic speech recognition system.

Index Terms: Speech Recognition, Agglutinative Language, Language Modeling, Part-Of-Speech, Stem

Full Paper

Bibliographic reference.  Sze, Hong Kai / Ping, Tan Tien / Kong, Tang Enya / Yu-N, Cheah (2010): "Malay language modeling in large vocabulary continuous speech recognition with linguistic information", In SLTU-2010, 56-61.