ISCA Archive ISCSLP 2004
ISCA Archive ISCSLP 2004

A Superposed Prosodic Model for Chinese Text-To-Speech Synthesis

GaoPeng Chen, Gerard Bailly, QingFeng Liu, RenHua Wang

The paper presents the application of the trainable SFC superpositional prosodic model to Chinese. Within the SFC model, prosodic parameters (F0, syllabic lengthening) are interpreted as the superposition of overlapping multiparametric contours. These contours are associated with high-level prosodic features operating at different scopes, such as tones, stress, prosodic boundary, part of speech of words, etc. Each feature label corresponds to a metalinguistic function (morphological, lexical, syntactic, attitudinal…) which is represented by a neural network. The observed contour is the sum of the outputs of the corresponding neural networks. An analysis-by-synthesis scheme is implemented for automatically learning. This model works well in the concatenation of neighbored units. The RMSE of F0 prediction is 2.34st (referenced to 200Hz), correlation is 0.86. Perceptual experiments show that the predicted prosody is quite appropriate and fluent. 1 INTRODUCTION The fundamental problem for intonation analysis and synthesis is that prosody is the acoustic encoding of a large number of linguistic and paralinguistic features. Two major classes of intonation models have evolved in the past two decades. Superpositional models interpret prosody as complex patterns resulting from the superposition of more simple components. Fujisaki model [5] is the typical model in this class, which decomposes F0 into phrase component and accent component. The parameters are associated with the mechanism of pronouncing, which is quite relevant to the macro-prosodic features. It has been tried on many languages including Chinese [4, 9]. Due to the different characteristics between tonal and non-tonal languages, it is difficult to simulate tone events by accent components. Besides, the automatic extraction of the phrase and accent commands from observed F0 is not a solved problem. Other proposals [1, 6, 11] face also the problem of the ill-posed problem of analysis, i.e. decomposing an observed contours into elementary contributions. The SFC [2, 7] implements a prosodic model initially proposed for French [1] which introduces a new model-constrained, data-driven method to generate prosody contours with very few prototypical movements. The SFC introduces an original training paradigm using an analysis-by-synthesis framework that iteratively decomposes prosodic contours and builds the prosodic model in the same time (see §2). On the other hand, there are models that claim that F0 contours are generated from a sequence of phonologically distinctive tones or categorically different pitch accents, which are determined locally. The typical ones are the Tilt model [10] in English, PENTA [12, 13] in Chinese. These models focus on local events, but they ignore the trait of prosody on a big unit, such as on phrase or clause. Chinese is a tone language with high-level, lowrising, low-falling, high-falling and neutral tones. The tone events are very important to the prosody of an utterance. Each syllable that is the carrier of a tone and a basic meaningful phonetic unit normally is an individual target of prediction. However, sentence declination and phrasing are important as well. In this paper a superposed model is proposed to model Chinese prosodic contours, and the sequences of tones, phrases and clauses are all considered. 2 DESCRIPTION OF THE MODEL Principles. SFC considers that the prosodic contour is the contribution of few basic metalinguistic functions (phonetic such as tonal distinctions, segmentation, salience, hierarchy…) acting on different units at various scopes. We suppose that (1) each function affect prosody by means of a function-specific multiparametric contour called functional contour (FC); (2) An FC is co-extensive to the units concerned by the function it implements this extend is called the domain or the scope of the FC and is independent from the other units or functions implied in the discourse structure; (3) the shape of a FC is only a function of its scope (and of course of the metalinguistic function it implements); (4) the predicted/target contour is the superposition of corresponding FC using an appropriate scale (logarithmic for both F0 and syllabic lengthening). Functional contour generators. All FC implementing a given prosodic function are generated by a unique functional contour generator (FCG). FCGs are now 0-7803-8678-7/04/$20.00 ©2004 IEEE 177 ISCSLP 2004

Cite as: Chen, G., Bailly, G., Liu, Q., Wang, R. (2004) A Superposed Prosodic Model for Chinese Text-To-Speech Synthesis. Proc. International Symposium on Chinese Spoken Language Processing, 177-180

  author={GaoPeng Chen and Gerard Bailly and QingFeng Liu and RenHua Wang},
  title={{A Superposed Prosodic Model for Chinese Text-To-Speech Synthesis}},
  booktitle={Proc. International Symposium on Chinese Spoken Language Processing},