Sixth International Conference on Spoken Language Processing
This paper proposes a novel combined compound splitting and phrase recombination method that optimizes the composition of the speech recognition lexicon for a given domain. Data-driven compound word splitting is followed by iterative recombination of high frequency combinations. Language model perplexity and size are the criteria used to identify a balance between compound decomposition, which reduces OOV, and lexical unit recombination, which packs additional context into a fixed-size vocabulary. The method provides a basis for lexicon design for a LVCSR system on the domain of German parliamentary speeches that is to be used as the foundation of a spoken document information retrieval system. The approach achieves a 35% reduction in OOV without a prohibitively large sacrifice in recognition performance.
Bibliographic reference. Larson, Martha / Willett, Daniel / Köhler, Joachim / Rigoll, Gerhard (2000): "Compound splitting and lexical unit recombination for improved performance of a speech recognition system for German parliamentary speeches", In ICSLP-2000, vol.3, 945-948.