8th European Conference on Speech Communication and Technology

Geneva, Switzerland
September 1-4, 2003


A Corpus-Based Decompounding Algorithm for German Lexical Modeling in LVCSR

Martine Adda-Decker


In this paper a corpus-based decompounding algorithm is described and applied for German LVCSR. The decompounding algorithm contributes to address two major problems for LVCSR: lexical coverage and letter-to-sound conversion. The idea of the algorithm is simple: given a word start of length k only few different characters can continue an admissible word in the language. But concerning compounds, if word start k reaches a constituent word boundary, the set of successor characters can theoretically include any character. The algorithm has been applied to a 300M word corpus with 2.6M distinct words. 800k decomposition rules have been extracted automatically. OOV (out of vocabulary) word reductions of 25% to 50% relative have been achieved using word lists from 65k to 600k words. Pronunciation dictionaries have been developed for the LIMSI 300k German recognition system. As no language specific knowledge is required beyond the text corpus, the algorithm can apply more generally to any compounding language.

Full Paper

Bibliographic reference.  Adda-Decker, Martine (2003): "A corpus-based decompounding algorithm for German lexical modeling in LVCSR", In EUROSPEECH-2003, 257-260.