ISCA Archive SLTU 2014
ISCA Archive SLTU 2014

A bilingual study on the prediction of morph-based improvement

Balázs Tarján, Tibor Fegyó, Péter Mihajlik

Morph-based language modeling has been efficiently applied in improving the accuracy of Large-Vocabulary Continuous Speech Recognition (LVCSR) systems - especially in morphologically rich languages. However, the rate of improvements varies greatly and the underlying principles have been only superficially studied. Having a method that can predict the expected improvement prior to experimentations would be largely useful. In this paper, we introduce language-independent factors affecting morphbased improvement and show how they can be utilized in estimating the effectiveness of statistical morph-based language modeling. The task was broadcast news transcription in two less investigated languages, Hungarian and Romanian. It was found that in case of under-resourced conditions morph-based models can bring significant improvement - even for a morphologically less rich language like Romanian. In addition, it was shown that noninitial morph tagging can constantly outperform explicit modeling of word-boundaries both in terms of letter and word accuracies.


Cite as: Tarján, B., Fegyó, T., Mihajlik, P. (2014) A bilingual study on the prediction of morph-based improvement. Proc. 4th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU 2014), 131-138

@inproceedings{tarjan14_sltu,
  author={Balázs Tarján and Tibor Fegyó and Péter Mihajlik},
  title={{A bilingual study on the prediction of morph-based improvement}},
  year=2014,
  booktitle={Proc. 4th Workshop on Spoken Language Technologies for Under-Resourced Languages  (SLTU 2014)},
  pages={131--138}
}