4th International Conference on Spoken Language Processing

Philadelphia, PA, USA
October 3-6, 1996

Predicting the Out-of-Vocabulary Rate and the Required Vocabulary Size for Speech Processing Applications

Johannes Müller, Holger Stahl, Manfred Lang

Institute for Human-Machine-Communication, Munich University of Technology, Munich, Germany

This paper describes an approach for predicting both the vocabulary size and the resulting out-of-vocabulary rate (OOV-rate) for a hypothetical extension of an existing text corpus. By splitting the original corpus into two different sub-corpora, vocabulary and OOV-rate can be determined for that special constellation. Average values are calculated for all combinations of sub-corpora and can be approximated by analytic function terms. These functions enable the easy prediction of the vocabulary size and the OOV-rate. The prediction accuracy results in a relative error below 4.6%.

