This paper proposes a novel combined compound splitting and phrase recombination method that optimizes the composition of the speech recognition lexicon for a given domain. Data-driven compound word splitting is followed by iterative recombination of high frequency combinations. Language model perplexity and size are the criteria used to identify a balance between compound decomposition, which reduces OOV, and lexical unit recombination, which packs additional context into a fixed-size vocabulary. The method provides a basis for lexicon design for a LVCSR system on the domain of German parliamentary speeches that is to be used as the foundation of a spoken document information retrieval system. The approach achieves a 35% reduction in OOV without a prohibitively large sacrifice in recognition performance.
Cite as: Larson, M., Willett, D., Köhler, J., Rigoll, G. (2000) Compound splitting and lexical unit recombination for improved performance of a speech recognition system for German parliamentary speeches. Proc. 6th International Conference on Spoken Language Processing (ICSLP 2000), vol. 3, 945-948, doi: 10.21437/ICSLP.2000-690
@inproceedings{larson00_icslp, author={Martha Larson and Daniel Willett and Joachim Köhler and Gerhard Rigoll}, title={{Compound splitting and lexical unit recombination for improved performance of a speech recognition system for German parliamentary speeches}}, year=2000, booktitle={Proc. 6th International Conference on Spoken Language Processing (ICSLP 2000)}, pages={vol. 3, 945-948}, doi={10.21437/ICSLP.2000-690} }