Knowledge of a speech recognizer's sensitivity to different speech production parameters can be used to improve the system or to predict its behaviour in a given application. In this report, a speech recognition system has been tested using manipulated synthetic speech. A text-to-speech system was used for producing words with the 9 Swedish long vowels in CVC context. A "normal" production of each word served as reference template for the recognition system. The test set consisted of the same words where the value of one control parameter at a time was changed from its original position. The mel cepstrum distance between the reference and the manipulated word was measured. Modifying the pitch, voice source spectral slope and the first four formant frequencies had large influence on the distance, while varying formant bandwiths resulted in small effects. The relation between individual formants is different to results from experiments using natural listeners. The results indicate that the sensitivity to pitch and voice source spectrum variation will degrade the recognizer's performance in speaker-independent applications and during stress and that some form of normalisation is needed.
Cite as: Crosnier, S., Blomberg, M., Elenius, K. (1989) Speech recognizer sensitivity to the variation of different control parameters in synthetic speech. Proc. Speech Input/Output Assessment and Speech Databases, Vol.2, 143-146
@inproceedings{crosnier89_sioa, author={Sabine Crosnier and Mats Blomberg and Kjell Elenius}, title={{Speech recognizer sensitivity to the variation of different control parameters in synthetic speech}}, year=1989, booktitle={Proc. Speech Input/Output Assessment and Speech Databases}, pages={Vol.2, 143-146} }