In this paper we describe a method that detects and remedies lexical stress errors in unit selection synthesis automatically using machine learning algorithms. If unintended stress patterns can be detected following unit selection, based on features available in the unit database, it may be possible to modify the units during waveform synthesis to correct errors and produce an acceptable stress pattern. Note that the TTS system being studied typically does no prosody modification on selected units, unlike most concatenative TTS systems.
We trained several machine learning algorithms using acoustic measurements from natural utterances and corresponding stress patterns: CART, Adaboost+CART, SVM and Max- Ent. Our experimental results showed that MaxEnt achieves the highest accuracy on natural stress pattern classification (83.3% for 3-syllable words, 88.7% for 4-syllable words correctly classified). Though precision rates are good in the classification of natural stress patterns, a large number of false alarms are produced in the classification of synthesized stress patterns when models trained with natural utterances were applied.
Results from a preference test showed that signal modifications based on false positives do little harm to the speech output, but also that listeners don’t find much difference between the raw TTS outputs and the post-processed ones.
Index Terms: speech synthesis, unit selection, lexical stress
Cite as: Kim, Y.-J., Beutnagel, M.C. (2010) A study of lexical stress patterns in unit selection synthesis. Proc. 7th ISCA Workshop on Speech Synthesis (SSW 7), 371-376
@inproceedings{kim10_ssw, author={Yeon-Jun Kim and Mark C. Beutnagel}, title={{A study of lexical stress patterns in unit selection synthesis}}, year=2010, booktitle={Proc. 7th ISCA Workshop on Speech Synthesis (SSW 7)}, pages={371--376} }