Third International Conference on Spoken Language Processing (ICSLP 94)
Formant-based synthetic speech is less robust in noise than natural speech and is often criticised as "robot-like". One reason may be the failure to model systematic spectral variation that is not crucial to phoneme identification. This study investigates how some aspects of consonantal context and stress contribute to this variation. CV sequences were embedded in carrier phrases to give quasi-meaningful English sentences. Vowels were /@ u i @/ or /V u i O/; /z/ or /r/ followed /u/; other consonants were either all Pol or all /d/. Stress was on the second syllable (Set 1), or the first and third (Set 2). F2 and F3 frequencies were lower when consonants were /r/ and /b/ rather than /z/ and /d/, but "r-lowering" was not significant in Set 2, presumably because vowel quality, and hence tongue position, were more constrained by stress. R-lowering may spread to syllables which are not adjacent to hi, typically across unstressed vowels and labial consonants. The measured differences were usually audible. Modelled in synthetic speech, they can increase the phonemic intelligibility of the speech in noise by about 15%.
Bibliographic reference. Hawkins, Sarah / Slater, Andrew (1994): "Spread of CV and v-to-v coarticulation in british English: implications for the intelligibility of synthetic speech", In ICSLP-1994, 57-60.