Improving Automatically Induced Lexicons for Highly Agglutinating Languages Using Data-Driven Morphological Segmentation

Wiehan Agenbag, Thomas Niesler


We present a method of improving the performance of automatically induced lexicons for highly agglutinating languages. Our previous work demonstrated the feasibility of using automatic sub-word unit discovery and lexicon induction to enable ASR for under-resourced languages. However, a particularly challenging case for such approaches is found in agglutinating languages, which have large vocabularies of infrequently used words. In this study, we address the unfavorable vocabulary distribution of such languages by performing data-driven morphological segmentation of the orthography prior to lexicon induction. We apply this novel step to a corpus of recorded radio broadcasts in Luganda, which is a highly agglutinating and severely under-resourced language. The intervention leads to a 10% (relative) reduction in WER, which puts the resulting ASR performance on par with an expert lexicon. When context is added to the morphological segments prior to lexicon induction, a further 1% WER reduction is achieved. This demonstrates that it is feasible to perform ASR in an under-resourced setting using an automatically induced lexicon even in the case of a highly agglutinating language.


 DOI: 10.21437/Interspeech.2019-2164

Cite as: Agenbag, W., Niesler, T. (2019) Improving Automatically Induced Lexicons for Highly Agglutinating Languages Using Data-Driven Morphological Segmentation. Proc. Interspeech 2019, 3515-3519, DOI: 10.21437/Interspeech.2019-2164.


@inproceedings{Agenbag2019,
  author={Wiehan Agenbag and Thomas Niesler},
  title={{Improving Automatically Induced Lexicons for Highly Agglutinating Languages Using Data-Driven Morphological Segmentation}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={3515--3519},
  doi={10.21437/Interspeech.2019-2164},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2164}
}