A recognition system based on a reference library of synthetic phoneme prototypes is described. The phoneme templates are specified in terms of control parameters to a serial formant synthesiser. The vocabulary and grammar is described in a finite-state phoneme network. Each phoneme is divided into a number of substates representing transitions and steady-state regions. The parameters of the transition states are interpolated from the steady-state parameters. At each state, a 16-channel filter bank section is computed from the synthesis parameters. Dynamic adaptation to the speaker's voice source spectrum is performed during recognition. Without adaptation, the average recognition for ten male speakers was 88% on an isolated-word task using a 26-word vocabulary. Adding voice source adaptation raised the performance to 96%. On a vocabulary of 3 connected digits, the adaptation technique improved the recognition rate for six male speakers from 87.7% to 92.8%. The improvement was largest for subjects with low initial recognition rate, indicating the usefulness of the voice source adaptation technique for certain voices. Current work is directed towards speaker adaptation of phoneme parameters and modelling of the variability of the parameter dynamics at phoneme boundaries.
Cite as: Blomberg, M. (1990) Adaptation to a speaker's voice in a speech recognition system based on synthetic phoneme references. Proc. ESCA Workshop on Speaker Characterization in Speech Technology, 58-65
@inproceedings{blomberg90_scst, author={Mats Blomberg}, title={{Adaptation to a speaker's voice in a speech recognition system based on synthetic phoneme references}}, year=1990, booktitle={Proc. ESCA Workshop on Speaker Characterization in Speech Technology}, pages={58--65} }