A six-dimensioned label set for annotating expressiveness of speech samples is proposed. Unlike conventional emotional annotation labels that require annotators to make rather difficult judgments on speakers' emotional (high-level) status, the new annotation set of six low-level labels, i.e., "pitch", "vocal effort", "voice age", "loudness", "speaking rate", and "speaking manner" can be more easily labeled by non-experts. 800 expressive utterances were annotated by four annotators with the proposed labels. The labeling also shows a good consistency (71%) among the annotators. The proposed six labels capture the different styles (expressiveness) well in the audio-book. The difference between styles, measured by the intensity of styles along the six labels, is highly correlated (0.85) with the perceptual distance obtained from a subjective AB test. A compact classification and regression tree (CART) is built to automatically group sentences of similar expressiveness into several "pure" speaking styles. The interpretation of each speaking style can be explicitly understood from the CART structure.
Cite as: Wang, L., Chu, M., Peng, Y., Zhao, Y., Soong, F.K. (2007) Perceptual annotation of expressive speech. Proc. 6th ISCA Workshop on Speech Synthesis (SSW 6), 46-51
@inproceedings{wang07_ssw, author={Lijuan Wang and Min Chu and Yaya Peng and Yong Zhao and Frank K. Soong}, title={{Perceptual annotation of expressive speech}}, year=2007, booktitle={Proc. 6th ISCA Workshop on Speech Synthesis (SSW 6)}, pages={46--51} }