The nature of the training process for a speech-recognition system changes radically once the size of the vocabulary becomes larger than the number of words for which a user is willing to provide training tokens. Below this threshold, it is reasonable to make an independent model for each word in the vocabulary. Such a model, based on data from that word and no others, can in principle capture all the acoustic-phonetic subtleties of the word, even though the phonetic spelling of the word is not even used in constructing the model.
For continuous speech recognition, the quantity of data required for complete training grows much more rapidly than vocabulary. In the simple case of a recognizer for three-digit strings, for example, each digit should at a minimum be trained in initial, medial, and final position, while for optimum performance all digit triples should be included in the training data.
Even if one could find a speaker who was willing to provide the necessary volume of training data, there would remain the problem of adapting to new speakers. As long as each possible utterance is regarded as independent of all others, training remains as time-consuming for new speakers as for the original speaker.
In recognizers that have a "front end" that attempts to recognize phonemes and a "back end" that recognizes words and sentences, it is natural to focus on phonemes in the training process. Even in a recognizer that makes no recognition decisions except the identity of the complete utterance that was spoken, it becomes essential to carry out training at the level of phonemes.
Cite as: Bamberg, P.G. (1990) Adaptable phoneme-based models for large-vocabulary speech recognition. Proc. ESCA Workshop on Speaker Characterization in Speech Technology, 1-9
@inproceedings{bamberg90_scst, author={Paul G. Bamberg}, title={{Adaptable phoneme-based models for large-vocabulary speech recognition}}, year=1990, booktitle={Proc. ESCA Workshop on Speaker Characterization in Speech Technology}, pages={1--9} }