2nd Workshop on Spoken Language Technologies for Under-Resourced Languages

Universiti Sains, Penang, Malaysia
May 3-5, 2010

On Morph-Based LVCSR Improvements

Balázs Tarján (1), Péter Mihajlik (1,2)

(1) Department of Telecommunication and Media Informatics, Budapest University of Technology & Economics, Hungary
(2) THINKTech Research Center Nonprofit LLC, Hungary

Efficient large vocabulary continuous speech recognition of morphologically rich languages is a big challenge due to the rapid vocabulary growth. To improve the results various subword units - called as morphs - are applied as basic language elements. The improvements over the word baseline, however, are changing from negative to error rate halving across languages and tasks. In this paper we make an attempt to explore the source of this variability. Different LVCSR tasks of an agglutinative language are investigated in numerous experiments using full vocabularies. The improvement results are compared to pre-existing other language results, as well. Important correlations are found between the morph-based improvements and between the vocabulary growths and the corpus sizes.

Full Paper

Bibliographic reference.  Tarján, Balázs / Mihajlik, Péter (2010): "On morph-based LVCSR improvements", In SLTU-2010, 10-16.