We present a novel methodology for speech prosody research based on the analysis of embeddings used to condition a convolutional WaveNet speech synthesis system. The methodology is evaluated using a corpus of Lombard speech, pre-processed in order to preserve only prosodic characteristics of the original recordings. The conditioning embeddings are trained to represent the combined influences of three sources of prosodic variation present in the corpus: the level and type of ambient noise, and the sentence focus type. We show that the resulting representations can be used to quantify the prosodic effects of the underlying influences, as well as interactions among them, in a statistically robust way. Comparing the results of our analysis with the results of a more traditional examination indicates that the presented methodology can be used as an alternative method of phonetic analysis of prosodic phenomena.
Cite as: Šimko, J., Vainio, M., Suni, A. (2020) Analysis of speech prosody using WaveNet embeddings: The Lombard effect. Proc. Speech Prosody 2020, 910-914, doi: 10.21437/SpeechProsody.2020-186
@inproceedings{simko20_speechprosody, author={Juraj Šimko and Martti Vainio and Antti Suni}, title={{Analysis of speech prosody using WaveNet embeddings: The Lombard effect}}, year=2020, booktitle={Proc. Speech Prosody 2020}, pages={910--914}, doi={10.21437/SpeechProsody.2020-186} }