Voice Quality Control Using Perceptual Expressions for Statistical Parametric Speech Synthesis Based on Cluster Adaptive Training

Yamato Ohtani, Koichiro Mori, Masahiro Morita


This paper describes novel voice quality control of synthetic speech using cluster adaptive training (CAT). In this method, we model voice quality factors labeled with perceptual expressions such as “Gender,” “Age” and “Brightness.” In advance, we obtain the intensity scores of the perceptual expressions by conducting a listening test, which evaluates differences of voice qualities between synthetic speech of average voice and that of the target. Then we build perceptual expression (PE) clusters that we call PE models (PEM) under the conditions that the average voice model is used as the bias cluster and the PE intensity scores are employed as the CAT weights. In synthesis, we can generate controlled synthetic speech by the linear combination of PEMs and the existing speaker’s model. Subjective results demonstrate that the proposed method can control the voice qualities with PEs in many cases and the target synthetic speech modified by PEMs achieves comparatively good speech quality.


DOI: 10.21437/Interspeech.2016-290

Cite as

Ohtani, Y., Mori, K., Morita, M. (2016) Voice Quality Control Using Perceptual Expressions for Statistical Parametric Speech Synthesis Based on Cluster Adaptive Training. Proc. Interspeech 2016, 2258-2262.

Bibtex
@inproceedings{Ohtani+2016,
author={Yamato Ohtani and Koichiro Mori and Masahiro Morita},
title={Voice Quality Control Using Perceptual Expressions for Statistical Parametric Speech Synthesis Based on Cluster Adaptive Training},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-290},
url={http://dx.doi.org/10.21437/Interspeech.2016-290},
pages={2258--2262}
}