Sixth International Conference on Spoken Language Processing
A series of experiments was performed to determine the extent to which voice-quality differences could be labelled automatically in a speech database. Using speech corpora of three different speaking styles from the same speaker as test material, hidden-Markov models were trained to distinguish the prosodic and acoustic characteristics of each style, and were used to re-label the voiced-segments in order to provide a single, merged, labelled corpus. Perceptual tests of speech synthesised by concatenation using CHATR showed that both prosodic and voice-quality cues to stylistic variation (in this case emotion) can be detected and labelled by the trained models. However, speech synthesised from the original separate databases was perceived as being more expressive.
Bibliographic reference. Campbell, Nick / Marumoto, Toru (2000): "Automatic labelling of voice-quality in speech databases for synthesis", In ICSLP-2000, vol.4, 468-471.