Speech prosody encodes information about language and communicative intent as well as speaker identity and state. Consequently, a host of speech technologies could benefit from increased understanding of prosodic phenomena and corresponding acoustics. A recently developed comprehensive prosodic transcription system called RaP (Rhythm-and-Pitch) annotates both perceived rhythmic prominences and pitch tones in speech. Using RaP-annotated speech corpora, the present work analyzes relationships between perceived prosodic events and acoustic features including syllable duration and novel measures of intensity and fundamental frequency. Canonical Correlation Analysis (CCA) reveals two dominant prosodic dimensions relating the acoustic features and RaP annotations. The first captures perceived prosodic emphasis of syllables indicated by strong metrical beats and significant pitch variability (i.e. presence of either high or low pitch tones). Acoustically, this dimension is described most by syllable duration followed by the mean intensity and fundamental frequency measures. The second CCA dimension then primarily discriminates pitch tone level (high versus low), indicated mainly by the mean fundamental frequency measure. Finally, within a leave-one-out cross-validation framework, RaP prosodic events are well-predicted from acoustic features (AUC between 0.78 and 0.84). Future work will exploit automated RaP labelling in contexts ranging from language learning to neurological disorder recognition.
Cite as: Godoy, E., Williamson, J.R., Quatieri, T.F. (2017) Canonical Correlation Analysis and Prediction of Perceived Rhythmic Prominences and Pitch Tones in Speech. Proc. Interspeech 2017, 3206-3210, doi: 10.21437/Interspeech.2017-1585
@inproceedings{godoy17_interspeech, author={Elizabeth Godoy and James R. Williamson and Thomas F. Quatieri}, title={{Canonical Correlation Analysis and Prediction of Perceived Rhythmic Prominences and Pitch Tones in Speech}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={3206--3210}, doi={10.21437/Interspeech.2017-1585} }