Bootstrapping a Text Normalization System for an Inflected Language. Numbers as a Test Case

Anna Björk Nikulásdóttir, Jón Guðnason


Text normalization is an important part of many natural language applications, in particular for text-to-speech systems. Text normalization poses special challenges for highly inflected languages since the correct morphological form for the normalization is not evident from the non-standard word, e.g. a digit.

In this paper we report on ongoing work on a text normalization system for Icelandic, a highly inflected North Germanic language. We describe experiments on the normalization of numbers and address the problem of choosing the correct morphological form of number names. We use language models trained on texts containing number names and inspect effects of different LMs on domain specific texts with a high ratio of digits. A partially class based LM, replacing number names with their part-of-speech tags, shows the best results in all domains. We further show that testing normalization on texts where number names have been converted to digits does not show representative results for texts originally containing digits: while a test set similar to the language model training data shows an error rate of 10.1% on inflected cardinals from 1–99, test sets originally containing digits show 45.3% and 55% error rates.


 DOI: 10.21437/Interspeech.2019-2367

Cite as: Nikulásdóttir, A.B., Guðnason, J. (2019) Bootstrapping a Text Normalization System for an Inflected Language. Numbers as a Test Case. Proc. Interspeech 2019, 4455-4459, DOI: 10.21437/Interspeech.2019-2367.


@inproceedings{Nikulásdóttir2019,
  author={Anna Björk Nikulásdóttir and Jón Guðnason},
  title={{Bootstrapping a Text Normalization System for an Inflected Language. Numbers as a Test Case}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={4455--4459},
  doi={10.21437/Interspeech.2019-2367},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2367}
}