A Hierarchical Predictor of Synthetic Speech Naturalness Using Neural Networks

Takenori Yoshimura, Gustav Eje Henter, Oliver Watts, Mirjam Wester, Junichi Yamagishi, Keiichi Tokuda


A problem when developing and tuning speech synthesis systems is that there is no well-established method of automatically rating the quality of the synthetic speech. This research attempts to obtain a new automated measure which is trained on the result of large-scale subjective evaluations employing many human listeners, i.e., the Blizzard Challenge. To exploit the data, we experiment with linear regression, feed-forward and convolutional neural network models, and combinations of them to regress from synthetic speech to the perceptual scores obtained from listeners. The biggest improvements were seen when combining stimulus- and system-level predictions.


DOI: 10.21437/Interspeech.2016-847

Cite as

Yoshimura, T., Henter, G.E., Watts, O., Wester, M., Yamagishi, J., Tokuda, K. (2016) A Hierarchical Predictor of Synthetic Speech Naturalness Using Neural Networks. Proc. Interspeech 2016, 342-346.

Bibtex
@inproceedings{Yoshimura+2016,
author={Takenori Yoshimura and Gustav Eje Henter and Oliver Watts and Mirjam Wester and Junichi Yamagishi and Keiichi Tokuda},
title={A Hierarchical Predictor of Synthetic Speech Naturalness Using Neural Networks},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-847},
url={http://dx.doi.org/10.21437/Interspeech.2016-847},
pages={342--346}
}