8th International Conference on Spoken Language Processing

Jeju Island, Korea
October 4-8, 2004

Large Vocabulary Continuous Speech Recognition for Estonian Using Morpheme Classes

Tanel Alumae

Tallinn Technical University, Estonia

This paper describes development of a large vocabulary continuous speaker independent speech recognition system for Estonian. Estonian is an agglutinative language and the number of different word forms is very large, in addition, the word order is relatively unconstrained. To achieve a good language coverage, we use pseudo-morphemes as basic units in a statistical trigram language model. To improve language model robustness, we automatically find morpheme classes and interpolate the morpheme model with the class-based model. The language model is trained on a newspaper corpus of 15 million word forms. Clustered triphones with multiple Gaussian mixture components are used for acoustic modeling. The system with interpolated morpheme language model is found to perform significantly better than the baseline word form trigram system in all areas. The word error rate of the best system is 27.3% which is a 10.0% absolute improvement over the baseline system.

Full Paper

Bibliographic reference.  Alumae, Tanel (2004): "Large vocabulary continuous speech recognition for estonian using morpheme classes", In INTERSPEECH-2004, 389-392.