This work presents a study on the suitability of prosodic and acoustic features, with a special focus on i-vectors, in expressive speech analysis and synthesis. For each utterance of two different databases, a laboratory recorded emotional acted speech, and an audiobook, several prosodic and acoustic features are extracted. Among them, i-vectors are built not only on the MFCC base, but also on F0, power and syllable durations. Then, unsupervised clustering is performed using different feature combinations. The resulting clusters are evaluated calculating cluster entropy for labeled portions of the databases. Additionally, synthetic voices are trained, applying speaker adaptive training, from the clusters built from the audiobook. The voices are evaluated in a perceptual test where the participants have to edit an audiobook paragraph using the synthetic voices. The objective results suggest that i-vectors are very useful for the audiobook, where different speakers (book characters) are imitated. On the other hand, for the laboratory recordings, traditional prosodic features outperform i-vectors. Also, a closer analysis of the created clusters suggest that different speakers use different prosodic and acoustic means to convey emotions. The perceptual results suggest that the proposed ivector based feature combinations can be used for audiobook clustering and voice training.
Cite as: Jauk, I., Bonafonte, A. (2016) Prosodic and Spectral iVectors for Expressive Speech Synthesis. Proc. 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9), 59-63, doi: 10.21437/SSW.2016-10
@inproceedings{jauk16_ssw, author={Igor Jauk and Antonio Bonafonte}, title={{Prosodic and Spectral iVectors for Expressive Speech Synthesis}}, year=2016, booktitle={Proc. 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9)}, pages={59--63}, doi={10.21437/SSW.2016-10} }