In this paper, a data-driven word decompounding algorithm is described and applied to a broadcast news corpus in Amharic. The baseline algorithm has been enhanced in order to address the problem of increased phonetic confusability arising from word decompounding by incorporating phonetic properties and some constraints on recognition units derived from prior forced alignment experiments. Speech recognition experiments have been carried out to validate the approach. Out of vocabulary (OOV) words rates can be reduced by 30% to 40% and an absolute Word Error Rate (WER) reduction of 0.4% has been achieved. The algorithm is relatively language independent and requires minimal adaptation to be applied to other languages.
Bibliographic reference. Pellegrini, Thomas / Lamel, Lori (2007): "Using phonetic features in unsupervised word decompounding for ASR with application to a less-represented language", In INTERSPEECH-2007, 1797-1800.