An English utterance was synthesized in four versions using sets of diphones produced under four different prosodic and contextual conditions. The synthesis used either accented di-phones only or appropriately located accented and unaccented diphones, with each of these conditions being repeated using neutral-context and differentiated-context diphones. They were presented to two listener groups, a native English and a non-native group for paired comparison acceptability judgements. The results show a massive preference for the stress- and context-differentiated condition. Both stress and context had a significant effect on acceptability judgements, but context-differentiation raised acceptability more strongly than stress-differentiation. Both the native and the main sub-group of non-native listeners judged the stimuli in essentially the same way.
Cite as: Barry, W., Nielsen, C., Andersen, O. (2001) Must diphone synthesis be so unnatural? Proc. 7th European Conference on Speech Communication and Technology (Eurospeech 2001), 975-978, doi: 10.21437/Eurospeech.2001-259
@inproceedings{barry01_eurospeech, author={William Barry and Claus Nielsen and Ove Andersen}, title={{Must diphone synthesis be so unnatural?}}, year=2001, booktitle={Proc. 7th European Conference on Speech Communication and Technology (Eurospeech 2001)}, pages={975--978}, doi={10.21437/Eurospeech.2001-259} }