Listening to even high quality text-to-speech - such as that generated by a Deep Neural Network (DNN) driving a vocoder - still requires greater cognitive effort than natural speech, under noisy conditions. Vocoding itself, plus errors in predictions of the vocoder speech parameters by the DNN model are assumed to be responsible. To better understand the contribution of each parameter, we construct a range of systems that vary from copysynthesis (i.e., vocoding) to full text-to-speech generated using a Deep Neural Network system. Each system combines some speech parameters (e.g., spectral envelope) from copy-synthesis with other speech parameters (e.g., F0) predicted from text. Cognitive load was measured using a pupillometry paradigm described in our previous work. Our results reveal the differing contributions that each predicted speech parameter makes to increasing cognitive load.
Cite as: Govender, A., Valentini-Botinhao, C., King, S. (2019) Measuring the contribution to cognitive load of each predicted vocoder speech parameter in DNN-based speech synthesis. Proc. 10th ISCA Workshop on Speech Synthesis (SSW 10), 121-126, doi: 10.21437/SSW.2019-22
@inproceedings{govender19_ssw, author={Avashna Govender and Cassia Valentini-Botinhao and Simon King}, title={{Measuring the contribution to cognitive load of each predicted vocoder speech parameter in DNN-based speech synthesis}}, year=2019, booktitle={Proc. 10th ISCA Workshop on Speech Synthesis (SSW 10)}, pages={121--126}, doi={10.21437/SSW.2019-22} }