Sixth International Conference on Spoken Language Processing
(ICSLP 2000)

Beijing, China
October 16-20, 2000

Automatic Labelling of Voice-Quality in Speech Databases for Synthesis

Nick Campbell (1), Toru Marumoto (2)

(1) ATR Spoken Langguage Translation Research Labs., Seika-cho, Soraku-gun, Kyoto, Japan
(2) Nara Inststute of Science and Technology, Japan

A series of experiments was performed to determine the extent to which voice-quality differences could be labelled automatically in a speech database. Using speech corpora of three different speaking styles from the same speaker as test material, hidden-Markov models were trained to distinguish the prosodic and acoustic characteristics of each style, and were used to re-label the voiced-segments in order to provide a single, merged, labelled corpus. Perceptual tests of speech synthesised by concatenation using CHATR showed that both prosodic and voice-quality cues to stylistic variation (in this case emotion) can be detected and labelled by the trained models. However, speech synthesised from the original separate databases was perceived as being more expressive.


Full Paper

Bibliographic reference.  Campbell, Nick / Marumoto, Toru (2000): "Automatic labelling of voice-quality in speech databases for synthesis", In ICSLP-2000, vol.4, 468-471.