Sixth International Conference on Spoken Language Processing
(ICSLP 2000)

Beijing, China
October 16-20, 2000

Compound Splitting and Lexical Unit Recombination for Improved Performance of a Speech Recognition System for German Parliamentary Speeches

Martha Larson (1,2), Daniel Willett (2,3), Joachim Köhler (1), Gerhard Rigoll (2)

(1) IMK: Institute for Media Communication, GMD German National Research Institute for Information Technology, Sankt Augustin, Germany
(2) Department of Computer Science, Faculty of Electrical Engineering, Duisburg University, Duisburg, Germany
(3) now with: NTT Communication Science Lab, Kyoto, Japan

This paper proposes a novel combined compound splitting and phrase recombination method that optimizes the composition of the speech recognition lexicon for a given domain. Data-driven compound word splitting is followed by iterative recombination of high frequency combinations. Language model perplexity and size are the criteria used to identify a balance between compound decomposition, which reduces OOV, and lexical unit recombination, which packs additional context into a fixed-size vocabulary. The method provides a basis for lexicon design for a LVCSR system on the domain of German parliamentary speeches that is to be used as the foundation of a spoken document information retrieval system. The approach achieves a 35% reduction in OOV without a prohibitively large sacrifice in recognition performance.

