Tone modeling using Gaussian process latent variable model for statistical speech synthesis

Decha Moungsri, Tomoki Koriyama, Takao Kobayashi


In continuous speech of Thai language, tone pronunciation is affected by several factors. One of significant factors is stress that causes a diversity of F0 contours of tone, and also affects syllable durations. Our previous studies have shown that a stressed/unstressed syllable context improves tone modeling accuracy. However, the stress in Thai language is generally unknown for a given input text and it has a wide variety of degrees of stress. Thus the simple stressed/unstressed context is not enough to represent the intensity of stress. In this study, we introduce an unsupervised dimensional reduction technique, variational GP-LVM, to represent a diversity of stress. The stress-related information, F0 contour and duration, is projected onto a latent space which has lower dimensionality than the original to represent the degree of stress. Then, we use data points in the latent space as a context in GPR-based speech synthesis framework that allows us to determine the similarity of contextual factors continuously using a kernel function. We examine two approaches to data projection: single-space projection and separated-space projection. Objective and subjective evaluation results show that the proposed technique achieves an improvement in tone modeling.


DOI: 10.21437/SpeechProsody.2016-208

Cite as

Moungsri, D., Koriyama, T., Kobayashi, T. (2016) Tone modeling using Gaussian process latent variable model for statistical speech synthesis. Proc. Speech Prosody 2016, 1014-1018.

Bibtex
@inproceedings{Moungsri+2016,
author={Decha Moungsri and Tomoki Koriyama and Takao Kobayashi},
title={Tone modeling using Gaussian process latent variable model for statistical speech synthesis},
year=2016,
booktitle={Speech Prosody 2016},
doi={10.21437/SpeechProsody.2016-208},
url={http://dx.doi.org/10.21437/SpeechProsody.2016-208},
pages={1014--1018}
}