ISCA Archive SCST 1990
ISCA Archive SCST 1990

Adaptation performance in a large-vocabulary recognizer

Paul G. Bamberg, Mark A. Mandel

For large-vocabulary speech recognizers the complete training process is so time-consuming that a typical user is unlikely to find it worth the effort. While a user might be willing to provide five training tokens for each of 200 words, no one is willing to provide that many tokens for each word in a 25,000 word vocabulary. Phoneme-based recognizers, such as the ones developed at Dragon Systems, require less training data and need to be trained fully only once per language, not once per task, but the complete training process requires expertise in acoustic phonetics that is beyond the typical user of a dictation system.

Fortunately, much of what is "learned" about a vocabulary in the process of training a phoneme-based speaker-dependent recognizer is in fact information that is nearly speaker-independent. Allophonic variation of phonemes, coarticulation effects between phonemes, and phoneme durations do not vary widely among different speakers.

When simple spectral parameters are employed, the parameters in the model for a given phoneme, even in a carefully controlled context, vary enough from one speaker to another to degrade recognition performance unacceptably. This may result both from differences in vocal tract length and other physiological parameters or from variations in vowel characteristics that are characteristic of different dialects. The goal of speaker adaptation is to correct for these differences on the basis on data drawn from a set of words that is much smaller than the total vocabulary size.

Human listeners quickly adapttoanew speaker when listening to a familiar language. Presumably this is because they have internalized rules for allophonic variation and coarticulation and can quickly extrapolate new information about vowel formants from one context to another.

Isolated-word recognizers that can independently adapt models for individual words have been available for several years. For large-vocabulary recognition, this form of adaptation is unacceptably slow, because adaptation of the entire vocabulary would require each word to be used at least once. A user has the right to anticipate that saying "educated", for example, should help also in adapting "educate", "educating", and "educates".

The DragonDictate recognizer employs standard hidden Markov model recognition techniques, but the models for 25,000 words are all based on about 2000 "phonemic segments". These phonemic segments are intended to provide a basis for rapid adaptation.

Cite as: Bamberg, P.G., Mandel, M.A. (1990) Adaptation performance in a large-vocabulary recognizer. Proc. ESCA Workshop on Speaker Characterization in Speech Technology, 46-52

  author={Paul G. Bamberg and Mark A. Mandel},
  title={{Adaptation performance in a large-vocabulary recognizer}},
  booktitle={Proc. ESCA Workshop on Speaker Characterization in Speech Technology},